Frozen Embeddings with InversionFromLogitsModel is incorrect #70

themurtazanazir · 2024-10-01T08:29:30Z

in tokenize_data.py

def embed_dataset_batch(model: InversionModel, batch: Dict) -> Dict:
    assert "input_ids" in batch.keys(), f"invalid keys {batch.keys()}"
    assert hasattr(model, "call_embedding_model")

    input_ids = batch["input_ids"]
    inputs_str = model.tokenizer.batch_decode(input_ids, skip_special_tokens=True)
    emb_input_ids = model.embedder_tokenizer(
        inputs_str,
        max_length=model.config.max_seq_length,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    ).to(next(model.parameters()).device)

    with torch.no_grad():
        batch["frozen_embeddings"] = model.call_embedding_model(**emb_input_ids)
    return batch

the tokens of embedder are sent to call_embedding_model.

But in models/inversion_from_logits.py

    def call_embedding_model(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
    ) -> torch.Tensor:
        embedder = self.embedder

        inputs_str = self.tokenizer.batch_decode(input_ids, skip_special_tokens=True)
        emb_input_ids = self.embedder_tokenizer(
            inputs_str,
            max_length=self.config.max_seq_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        ).to(next(self.parameters()).device)

        model_output = embedder(**emb_input_ids)
        return self._process_embedder_output(model_output, emb_input_ids.attention_mask)

This function expects the model.tokenizer's, not model.embedder_tokenizers's tokens.

This causes gibberish tokens to be sent to embedder.

The text was updated successfully, but these errors were encountered:

themurtazanazir · 2024-10-01T08:29:55Z

I can raise a PR if needed.

jxmorris12 · 2024-10-01T15:11:51Z

@themurtazanazir thank you for finding this – a pull request would be amazing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frozen Embeddings with InversionFromLogitsModel is incorrect #70

Frozen Embeddings with InversionFromLogitsModel is incorrect #70

themurtazanazir commented Oct 1, 2024

themurtazanazir commented Oct 1, 2024 •

edited

Loading

jxmorris12 commented Oct 1, 2024

Frozen Embeddings with InversionFromLogitsModel is incorrect #70

Frozen Embeddings with InversionFromLogitsModel is incorrect #70

Comments

themurtazanazir commented Oct 1, 2024

themurtazanazir commented Oct 1, 2024 • edited Loading

jxmorris12 commented Oct 1, 2024

themurtazanazir commented Oct 1, 2024 •

edited

Loading