Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPR: Max length for query tokenizer model encoding #625

Closed
psorianom opened this issue Nov 14, 2020 · 3 comments
Closed

DPR: Max length for query tokenizer model encoding #625

psorianom opened this issue Nov 14, 2020 · 3 comments
Assignees

Comments

@psorianom
Copy link

Hi all!

Thank you for the development regarding DPR training (multi gpu specially 👍 ) !

I am currently trying to train a DPR model and I am having a problem. When transforming the input dict to Sample objects, and more specifically, when encoding with the tokenizer the query, by default we have max_length=self.max_seq_len_query and self.max_seq_len_query=64 and truncation_strategy="do_not_truncate". This generates a ValueError exception whenever our query is larger than 64 here:

        except ValueError:
            cur_tensor = torch.tensor(
                [sample[t_name] for sample in features], dtype=torch.float32
            )

error:

ValueError: expected sequence of length 64 at dim 1 (got 83)

I locally changed the truncation_strategy to longest_first and I no longer have this error. Still, I do wonder what would be the correct approach to follow ? I believe that truncating may be the best option (maybe with a warning to the user) as increasing the max_seq_len_query parameter would still be a case-by-case solution, crashing on users with len(queries) > n.

I am using linux mint 20, with gpu, cloned FARM commit : 2fabc31 (latest as of this writing).

Thank you !

@Timoeller
Copy link
Contributor

Nice! Thanks for using the new DPR functionality. @tholor could you look into this? It seems to be a bug on our side.

@tholor
Copy link
Member

tholor commented Nov 16, 2020

Implemented a fix in #629.

In terms of user notification: we already have the "truncation stats" logged by the DataSilo that inform you how many of your samples got truncated. We could extend these stats for the BiAdaptiveModel case where we need to generate two separate stats (for query and passage).

@tholor tholor closed this as completed Nov 17, 2020
@psorianom
Copy link
Author

Oh ok, I missed the DataSilo logs. I will take a look at them. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants