You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi,
the provided code was designed to extract word embedding after extraction of transcripts which is formatted in a specific way.
i changed the code to extract the word embedding from a given text transcript.
extract_word_embedding_fn:
defextract_word_embedding_fn(self, idx, output_filename, tokenizer, bert):
ds=self.config['dataset_name']
assertdsin [constants.MELD, constants.C_EXPR_DB], ds# transcripts have already been extracted...transcript: str=self.per_trial_info[idx]['video_transcript']
# not used. we failed to use extract_trascript_fn.# we access the transcript directly.# input_path = join(self.config['output_root_directory'],# *self.per_trial_info[idx]['punctuation_path'][-2:])input_path=Noneoutput_path=join(self.config['output_root_directory'],
self.config['word_embedding_folder'],
output_filename+".csv")
ensure_dir(output_path)
bert.cuda()
extract_word_embedding(input_path, output_path, tokenizer,
bert, transcript=transcript)
self.per_trial_info[idx]['processing_record'][
'embedding_path'] =output_path.split(os.sep)
self.per_trial_info[idx]['embedding_path'] =output_path.split(os.sep)
defextract_word_embedding(input_path, output_path, tokenizer, bert,
max_length=256, transcript: str=None):
ifnotos.path.isfile(output_path):
ifinput_pathisNone:
asserttranscriptisnotNone, transcriptdf_try= ['']
assertlen(transcript) >0, f"{len(transcript)} | {transcript}"else:
asserttranscriptisNonedf_try=pd.read_csv(input_path, header=None, sep=";")
if (notlen(df_try) ==1) or (transcriptisnotNone):
fromtorch.utils.dataimportTensorDataset, DataLoaderifinput_pathisnotNone:
df=pd.read_csv(input_path, header=None, sep=";", skiprows=1)
str_words= [str(word) forwordindf.values[:, 2]]
num_tokens=len(df)
paragraph= [" ".join(str_words)][0]
else:
asserttranscriptisnotNonenum_tokens=len(transcript.split(' '))
paragraph=copy.deepcopy(transcript)
token_ids, token_masks, paragraph=tokenize(
paragraph, tokenizer, max_length=max_length)
dataset=TensorDataset(token_ids, token_masks)
data_loader=DataLoader(dataset, batch_size=32, shuffle=False)
token_vecs_sum=calculate_token_embeddings(data_loader, bert)
bert_features=exclude_padding(token_vecs_sum, token_masks)
# The indices help to restore the bert feature for a one-to-one# correspondence to the input tokenizers.idx_intact, idx_target, idx_non_sub_words, idx_grouped_sub_words\
=get_sub_word_idx(paragraph, tokenizer)
average_merged_bert_features=average_merge_embeddings(
num_tokens, idx_intact, idx_target, bert_features,
idx_non_sub_words, idx_grouped_sub_words)
msg=f"{len(average_merged_bert_features)} | {num_tokens}"assertlen(average_merged_bert_features) ==num_tokens, msgcombined_df=np.c_[df.values, average_merged_bert_features] # needs adaptation as no df is defined.# errorcombined_df=compress_single_quote(combined_df)
# errorcombined_df.to_csv(output_path, sep=";", index=False)
there is an issue with average_merge_embeddings.
after simplifying avereging, another issues shows in compress_single_quote. this last one seems also to add a new header '["start", "end", "word", "confidence", *np.arange(768)]'.
here is a standalone code to do all the above for a simple example.
importmore_itertoolsasmitimportcopyimportnumpyasnpfromtorch.utils.dataimportTensorDataset, DataLoaderfromabaw5_preprocessing.base.speechimporttokenizefromabaw5_preprocessing.base.speechimportcalculate_token_embeddingsfromabaw5_preprocessing.base.speechimportget_sub_word_idxfromabaw5_preprocessing.base.speechimportaverage_merge_embeddingsfromabaw5_preprocessing.base.speechimportcompress_single_quotefromabaw5_preprocessing.base.speechimportexclude_paddingdefmerge_features(length: int, bert_features) ->np.ndarray:
# merge adjacent features into single to form one-to-one alignment# between words and feature vectors.average_merged_matrix=np.zeros((length, 768), dtype=np.float32)
n=bert_features.shape[0]
l=list(range(n))
blocks= [list(c) forcinmit.divide(length, l)]
assertlen(blocks) ==length, f"{len(blocks)} | {length}"fori, blockinenumerate(blocks):
assertblock!= [], blockaverage=np.mean(bert_features[block], axis=0)
average_merged_matrix[i] =averagereturnaverage_merged_matrixfromtransformersimportBertTokenizer, BertModelbert_tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')
bert=BertModel.from_pretrained('bert-base-uncased',
output_hidden_states=True)
bert.eval()
bert.cuda()
tokenizer=bert_tokenizerdefextract_word_embedding(transcript):
max_length=256num_tokens=len(transcript.split(' '))
paragraph=copy.deepcopy(transcript)
token_ids, token_masks, paragraph=tokenize(
paragraph, tokenizer, max_length=max_length)
dataset=TensorDataset(token_ids, token_masks)
data_loader=DataLoader(dataset, batch_size=32, shuffle=False)
token_vecs_sum=calculate_token_embeddings(data_loader, bert)
bert_features=exclude_padding(token_vecs_sum, token_masks)
# The indices help to restore the bert feature for a one-to-one# correspondence to the input tokenizers.idx_intact, idx_target, idx_non_sub_words, idx_grouped_sub_words \
=get_sub_word_idx(paragraph, tokenizer)
# average_merged_bert_features = average_merge_embeddings(# num_tokens, idx_intact, idx_target, bert_features,# idx_non_sub_words, idx_grouped_sub_words)average_merged_bert_features=merge_features(num_tokens, bert_features)
msg=f"{len(average_merged_bert_features)} | {num_tokens}"assertlen(average_merged_bert_features) ==num_tokens, msg# num_tokens, 1 + 768combined_df=np.c_[transcript.split(' '), average_merged_bert_features]
# errorcombined_df=compress_single_quote(combined_df)
if__name__=="__main__":
extract_word_embedding("Okay, how about Sunday?")
i currently changed to the storing format of the average embedding to: np.save('bert_features_example.npy', average_merged_bert_features) instead of csv as you did since compress_single_quote is not working.
any idea why your averaging fails here?
thanks
The text was updated successfully, but these errors were encountered:
Hi, the functions here were designed for English speech in ABAW3 video data using the Vosk. I am not sure if it works with other videos.
Vosk, when fed with audio or video, would produce the transcription with the time stamps for each spoken word. My code then repair the transcript (adding punctuation, tokenize, add/delete wrongs, feed to bert producing the token-level embedding, get the mapping between the tokens and words, and "average" the embedding of those grouped tokens, and get the exact "word" embedding, then populate it according to the time stamp.
It seems that you are only interested in extracting word embedding from given text? Without considering the time stamp, there are many ways to obtain the sentence-level or word-level embedding can be obtained. One may be explained at here. More should be found by Google or ChatGPT.
hi,
the provided code was designed to extract word embedding after extraction of transcripts which is formatted in a specific way.
i changed the code to extract the word embedding from a given text transcript.
extract_word_embedding_fn
:there is an issue with
average_merge_embeddings
.after simplifying avereging, another issues shows in
compress_single_quote
. this last one seems also to add a new header '["start", "end", "word", "confidence", *np.arange(768)]'.here is a standalone code to do all the above for a simple example.
i currently changed to the storing format of the average embedding to:
np.save('bert_features_example.npy', average_merged_bert_features)
instead of csv as you did sincecompress_single_quote
is not working.any idea why your averaging fails here?
thanks
The text was updated successfully, but these errors were encountered: