Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract word embedding directly from text #15

Open
sbelharbi opened this issue Jul 8, 2024 · 1 comment
Open

Extract word embedding directly from text #15

sbelharbi opened this issue Jul 8, 2024 · 1 comment

Comments

@sbelharbi
Copy link

sbelharbi commented Jul 8, 2024

hi,
the provided code was designed to extract word embedding after extraction of transcripts which is formatted in a specific way.
i changed the code to extract the word embedding from a given text transcript.

extract_word_embedding_fn:

    def extract_word_embedding_fn(self, idx, output_filename, tokenizer, bert):

        ds = self.config['dataset_name']
        assert ds in [constants.MELD, constants.C_EXPR_DB], ds

        # transcripts have already been extracted...
        transcript: str = self.per_trial_info[idx]['video_transcript']
        # not used. we failed to use extract_trascript_fn.
        # we access the transcript directly.

        # input_path = join(self.config['output_root_directory'],
        #                   *self.per_trial_info[idx]['punctuation_path'][-2:])

        input_path = None

        output_path = join(self.config['output_root_directory'],
                           self.config['word_embedding_folder'],
                           output_filename + ".csv")
        ensure_dir(output_path)

        bert.cuda()
        extract_word_embedding(input_path, output_path, tokenizer,
                               bert, transcript=transcript)

        self.per_trial_info[idx]['processing_record'][
            'embedding_path'] = output_path.split(os.sep)
        self.per_trial_info[idx]['embedding_path'] = output_path.split(os.sep)
def extract_word_embedding(input_path, output_path, tokenizer, bert,
                           max_length=256, transcript: str = None):
    if not os.path.isfile(output_path):

        if input_path is None:
            assert transcript is not None, transcript
            df_try = ['']
            assert len(transcript) > 0, f"{len(transcript)} | {transcript}"
        else:
            assert transcript is None
            df_try = pd.read_csv(input_path, header=None, sep=";")


        if (not len(df_try) == 1) or  (transcript is not None):
            from torch.utils.data import TensorDataset, DataLoader

            if input_path is not None:

                df = pd.read_csv(input_path, header=None, sep=";", skiprows=1)
                str_words = [str(word) for word in df.values[:, 2]]
                num_tokens = len(df)
                paragraph = [" ".join(str_words)][0]
            else:
                assert transcript is not None
                num_tokens = len(transcript.split(' '))
                paragraph = copy.deepcopy(transcript)

            token_ids, token_masks, paragraph = tokenize(
                paragraph, tokenizer, max_length=max_length)

            dataset = TensorDataset(token_ids, token_masks)
            data_loader = DataLoader(dataset, batch_size=32, shuffle=False)

            token_vecs_sum = calculate_token_embeddings(data_loader, bert)
            bert_features = exclude_padding(token_vecs_sum, token_masks)

            # The indices help to restore the bert feature for a one-to-one
            # correspondence to the input tokenizers.
            idx_intact, idx_target, idx_non_sub_words, idx_grouped_sub_words\
                = get_sub_word_idx(paragraph, tokenizer)

            average_merged_bert_features = average_merge_embeddings(
                num_tokens, idx_intact, idx_target, bert_features,
                idx_non_sub_words, idx_grouped_sub_words)
            msg = f"{len(average_merged_bert_features)} | {num_tokens}"
            assert len(average_merged_bert_features) == num_tokens, msg


            combined_df = np.c_[df.values, average_merged_bert_features]  # needs adaptation as no df is defined.
            # error
            combined_df = compress_single_quote(combined_df)
            # error
            combined_df.to_csv(output_path, sep=";", index=False)

there is an issue with average_merge_embeddings.
after simplifying avereging, another issues shows in compress_single_quote. this last one seems also to add a new header '["start", "end", "word", "confidence", *np.arange(768)]'.

here is a standalone code to do all the above for a simple example.

import more_itertools as mit

import copy

import numpy as np

from torch.utils.data import TensorDataset, DataLoader

from abaw5_preprocessing.base.speech import tokenize
from abaw5_preprocessing.base.speech import calculate_token_embeddings
from abaw5_preprocessing.base.speech import get_sub_word_idx
from abaw5_preprocessing.base.speech import average_merge_embeddings
from abaw5_preprocessing.base.speech import compress_single_quote
from abaw5_preprocessing.base.speech import exclude_padding

def merge_features(length: int, bert_features) -> np.ndarray:
    # merge adjacent features into single to form one-to-one alignment
    # between words and feature vectors.
    average_merged_matrix = np.zeros((length, 768), dtype=np.float32)
    n = bert_features.shape[0]
    l = list(range(n))
    blocks = [list(c) for c in mit.divide(length, l)]
    assert len(blocks) == length, f"{len(blocks)} | {length}"

    for i, block in enumerate(blocks):
        assert block != [], block
        average = np.mean(bert_features[block], axis=0)
        average_merged_matrix[i] = average

    return average_merged_matrix

from transformers import BertTokenizer, BertModel
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased',
                                 output_hidden_states=True)
bert.eval()
bert.cuda()
tokenizer = bert_tokenizer


def extract_word_embedding(transcript):
    max_length = 256

    num_tokens = len(transcript.split(' '))
    paragraph = copy.deepcopy(transcript)

    token_ids, token_masks, paragraph = tokenize(
        paragraph, tokenizer, max_length=max_length)

    dataset = TensorDataset(token_ids, token_masks)
    data_loader = DataLoader(dataset, batch_size=32, shuffle=False)

    token_vecs_sum = calculate_token_embeddings(data_loader, bert)
    bert_features = exclude_padding(token_vecs_sum, token_masks)

    # The indices help to restore the bert feature for a one-to-one
    # correspondence to the input tokenizers.

    idx_intact, idx_target, idx_non_sub_words, idx_grouped_sub_words \
        = get_sub_word_idx(paragraph, tokenizer)

    # average_merged_bert_features = average_merge_embeddings(
    #     num_tokens, idx_intact, idx_target, bert_features,
    #     idx_non_sub_words, idx_grouped_sub_words)

    average_merged_bert_features = merge_features(num_tokens, bert_features)

    msg = f"{len(average_merged_bert_features)} | {num_tokens}"
    assert len(average_merged_bert_features) == num_tokens, msg

    # num_tokens, 1 + 768
    combined_df = np.c_[transcript.split(' '), average_merged_bert_features]

    # error
    combined_df = compress_single_quote(combined_df)


if __name__ == "__main__":
    extract_word_embedding("Okay, how about Sunday?")

i currently changed to the storing format of the average embedding to: np.save('bert_features_example.npy', average_merged_bert_features) instead of csv as you did since compress_single_quote is not working.

any idea why your averaging fails here?
thanks

@sucv
Copy link
Owner

sucv commented Jul 16, 2024

Hi, the functions here were designed for English speech in ABAW3 video data using the Vosk. I am not sure if it works with other videos.

Vosk, when fed with audio or video, would produce the transcription with the time stamps for each spoken word. My code then repair the transcript (adding punctuation, tokenize, add/delete wrongs, feed to bert producing the token-level embedding, get the mapping between the tokens and words, and "average" the embedding of those grouped tokens, and get the exact "word" embedding, then populate it according to the time stamp.

It seems that you are only interested in extracting word embedding from given text? Without considering the time stamp, there are many ways to obtain the sentence-level or word-level embedding can be obtained. One may be explained at here. More should be found by Google or ChatGPT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants