Python Forum

Hi python experts,

I tried to get embedding from pre-trained language model with the function listed below. However, it took forever to execute if I have one thousand articles and each of which have at least 500 words. Could anyone suggest how to modify the code to speed up the process. Thanks!

from transformers import BertTokenizer, BertModel
import torch
def get_bert_embeddings(text_list, batch_size=32):
    embeddings = []
    
    # Process texts in batches
    for i in range(0, len(text_list), batch_size):
        batch_texts = text_list[i:i + batch_size]
        inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Use the embeddings from the [CLS] token
        batch_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
        embeddings.extend(batch_embeddings)

Maybe you could post some example text and example output?

(May-25-2024, 07:14 AM)Pedroski55 Wrote: [ -> ]Maybe you could post some example text and example output?

Hi,
the example dataset can be created by the following code:

import pandas as pd
#creata example data set
ex = pd.DataFrame()
ex['content2'] = ["波士頓一座充滿歷史和文化魅力的城市擁有眾多著名景點若只有一天時間遊覽波士頓以下行程可以幫助你在有限時間內充分體驗這座城市的魅力早上從波士頓公園開始這是美國最古老的公園漫步於這片綠地可以感受到城市的悠閒氛圍接著前往隔壁的公共花園這裡有美麗的天鵝船和各種季節性的花卉展示是拍照的好地方從公共花園出發沿著自由之路步行這條紅磚路線串聯了波士頓的重要歷史地標首先是馬薩諸塞州議會大廈接著是老州議會大廈和法尼爾廳這些建築見證了美國革命的重要時刻中午可以到昆西市場享用午餐這裡匯集了各種美食不妨試試當地的龍蝦卷或蛤蜊濃湯品味波士頓的地道風味用餐後繼續沿著自由之路前行到北區參觀保羅里維爾故居和老北教堂這裡是保羅里維爾騎馬報信的出發地下午可以參觀科學博物館這裡有豐富的互動展品適合各年齡層的遊客若對藝術感興趣不妨到波士頓美術館館內收藏了大量來自世界各地的藝術品讓人目不暇給傍晚時分前往波士頓港乘坐遊船欣賞城市天際線和海港風光夜晚的波士頓燈火輝煌令人陶醉在遊船上欣賞夕陽西下為這一天畫上完美句點晚餐可以選擇在北區的義大利餐廳這裡有眾多傳統義大利菜讓人食指大動最後若時間允許可以到查爾斯河畔散步享受夜晚的寧靜與美麗波士頓一日遊雖然時間緊湊但通過這樣的安排能夠讓你充分感受到這座城市的歷史底蘊和現代魅力希望這段旅程能給你留下美好的回憶"]
# duplicated the rows
ex=pd.concat([ex]*1000)
#get embedding
embeddings = get_bert_embeddings(list(ex['content2']), batch_size=2)

vocab.txt is missing, must be somewhere on google! You need a Chinese vocab.txt! Can't start the tokenizer.

Below should be the tokenizer found in your function: get_bert_embeddings() but it is not initialised.

Quote:tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

Must be somewhere on google! You need a Chinese vocab.txt!

This is from tokenization_bert.py

line 29:

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

line 32:

def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()

line 111:

if not os.path.isfile(vocab_file):
            raise ValueError(
                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
                " model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
            )
    for index, token in enumerate(tokens):
        token = token.rstrip("\n")
        vocab[token] = index
    return vocab

Where is the pre-trained model?

Quote:f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
" model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)"

I downloaded the modules you use, but vocab.txt is not on my computer.

Maybe it is possible to start with an empty vocab.txt??

I will try that!

(May-27-2024, 06:18 AM)Pedroski55 Wrote: [ -> ]vocab.txt is missing, must be somewhere on google! You need a Chinese vocab.txt! Can't start the tokenizer.

Below should be the tokenizer found in your function: get_bert_embeddings() but it is not initialised.

Quote:tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

Must be somewhere on google! You need a Chinese vocab.txt!

This is from tokenization_bert.py

line 29:
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
line 32:
def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()
line 111:
if not os.path.isfile(vocab_file):
            raise ValueError(
                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
                " model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
            )
    for index, token in enumerate(tokens):
        token = token.rstrip("\n")
        vocab[token] = index
    return vocab
Where is the pre-trained model?

Quote:f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
" model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)"

I downloaded the modules you use, but vocab.txt is not on my computer.

Maybe it is possible to start with an empty vocab.txt??

I will try that!

Sorry for the missing information. Please use the following information!

model_name = 'bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
model.eval() # Set the model to evaluation mode

Have a look here:

from transformers import BertTokenizer, BertModel
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

#tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)
# tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese")

def get_bert_embeddings(text_list, batch_size=2):
    embeddings = []
     
    # Process texts in batches
    for i in range(0, len(text_list), batch_size):
        batch_texts = text_list[i:i + batch_size]
        inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
         
        # Use the embeddings from the [CLS] token
        batch_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
        embeddings.extend(batch_embeddings)
    return embeddings

result_embeddings = get_bert_embeddings(list(ex['content2']), batch_size=2)

Just gives the error:

Output:
AttributeError: 'MaskedLMOutput' object has no attribute 'last_hidden_state'

Where did you get the function get_bert_embeddings(text_list, batch_size=2) from??

Is this what you need (contains vocab.txt for traditional chinese)

Played with this some more:

Just using 1 set of text

#! /usr/bin/python3
import pandas as pd
#create an example data set
ex = pd.DataFrame()
ex['content2'] = ["波士頓 一座 充滿 歷史 和 文化 魅力 的 城市 擁有 眾多 著名景點 若 只有 一天 時間 遊覽 波士頓 以下 行程 可以 幫助 你 在 有限 時間 內 充分 體驗 這座 城市 的 魅力 早上 從 波士頓 公園 開始 這是 美國 最 古老 的 公園 漫步 於 這片 綠地 可以 感受 到 城市 的 悠閒 氛圍 接著 前往 隔壁 的 公共 花園 這裡 有 美麗 的 天鵝 船 和 各種 季節性 的 花卉 展示 是 拍照 的 好 地方 從 公共 花園 出發 沿著 自由 之路 步行 這條 紅磚 路線 串聯 了 波士頓 的 重要 歷史 地標 首先 是 馬薩諸塞州 議會 大廈 接著 是 老州 議會 大廈 和 法 尼爾 廳 這些 建築 見證 了 美國 革命 的 重要 時刻 中午 可以 到 昆西 市場 享用 午餐 這裡 匯集 了 各種 美食 不妨 試試 當地 的 龍蝦 卷 或 蛤蜊 濃湯 品味 波士頓 的 地道 風味 用餐 後 繼續 沿著 自由 之路 前行 到 北區 參觀 保羅 里 維爾 故居 和 老北 教堂 這裡 是 保羅 里 維爾 騎馬 報信 的 出發地 下午 可以 參觀 科學 博物館 這裡 有 豐富 的 互動 展品 適合 各 年齡層 的 遊客 若 對 藝術 感興趣 不妨 到 波士頓 美術館 館內 收藏 了 大量 來自 世界各地 的 藝術品 讓 人 目不暇給 傍晚 時分 前往 波士頓 港 乘坐 遊船 欣賞 城市 天際線 和 海港 風光 夜晚 的 波士頓 燈火輝煌 令人 陶醉 在 遊船 上 欣賞 夕陽西下 為 這 一天 畫上 完美 句點 晚餐 可以 選擇 在 北區 的 義大利 餐廳 這裡 有眾 多 傳統 義大利 菜 讓 人 食指大動 最後 若 時間 允許 可以 到 查爾斯 河畔 散步 享受 夜晚 的 寧靜 與 美麗 波士頓 一日遊 雖然 時間 緊湊 但 通過 這樣 的 安排 能夠 讓 你 充分 感受 到 這座 城市 的 歷史 底蘊 和 現代 魅力 希望 這段 旅程 能給 你 留下 美好 的 回憶"]
text_list = list(ex['content2'])
from transformers import BertTokenizer, BertModel
import torch
from timeit import default_timer as timer

tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
model = BertModel.from_pretrained("bert-base-chinese")

def get_bert_embeddings(text_list, batch_size=2):
    embeddings = []
     
    # Process texts in batches
    for i in range(0, len(text_list), batch_size):
        batch_texts = text_list[i:i + batch_size]
        inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
         
        # Use the embeddings from the [CLS] token
        batch_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
        embeddings.extend(batch_embeddings)
    return embeddings

# returns a list of arrays
start = timer()
result_embeddings = get_bert_embeddings(list(ex['content2']), batch_size=2)
end = timer()
print(f'time elapsed = {end - start} seconds')

Gives:

Output:pedro@pedro-HP:~/myPython/transformers$ ./word_sorting_maybe.py
time elapsed = 0.6069911359954858 seconds
pedro@pedro-HP:~/myPython/transformers$

The text has 556 words.

No idea what this actually does, but it was interesting!

veda

Pedroski55

veda

Pedroski55

veda

Pedroski55

Larz60+

Pedroski55