Hi python experts,
I tried to get embedding from pre-trained language model with the function listed below. However, it took forever to execute if I have one thousand articles and each of which have at least 500 words. Could anyone suggest how to modify the code to speed up the process. Thanks!
I tried to get embedding from pre-trained language model with the function listed below. However, it took forever to execute if I have one thousand articles and each of which have at least 500 words. Could anyone suggest how to modify the code to speed up the process. Thanks!
from transformers import BertTokenizer, BertModel import torch def get_bert_embeddings(text_list, batch_size=32): embeddings = [] # Process texts in batches for i in range(0, len(text_list), batch_size): batch_texts = text_list[i:i + batch_size] inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) # Use the embeddings from the [CLS] token batch_embeddings = outputs.last_hidden_state[:, 0, :].numpy() embeddings.extend(batch_embeddings)
Larz60+ write May-24-2024, 07:05 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Fixed for you this time. Please use BBCode tags on future posts.
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Fixed for you this time. Please use BBCode tags on future posts.