DPR: Max length for query tokenizer model encoding #625

psorianom · 2020-11-14T15:45:03Z

Hi all!

Thank you for the development regarding DPR training (multi gpu specially 👍 ) !

I am currently trying to train a DPR model and I am having a problem. When transforming the input dict to Sample objects, and more specifically, when encoding with the tokenizer the query, by default we have max_length=self.max_seq_len_query and self.max_seq_len_query=64 and truncation_strategy="do_not_truncate". This generates a ValueError exception whenever our query is larger than 64 here:

        except ValueError:
            cur_tensor = torch.tensor(
                [sample[t_name] for sample in features], dtype=torch.float32
            )

error:

ValueError: expected sequence of length 64 at dim 1 (got 83)

I locally changed the truncation_strategy to longest_first and I no longer have this error. Still, I do wonder what would be the correct approach to follow ? I believe that truncating may be the best option (maybe with a warning to the user) as increasing the max_seq_len_query parameter would still be a case-by-case solution, crashing on users with len(queries) > n.

I am using linux mint 20, with gpu, cloned FARM commit : 2fabc31 (latest as of this writing).

Thank you !

The text was updated successfully, but these errors were encountered:

Timoeller · 2020-11-16T08:54:14Z

Nice! Thanks for using the new DPR functionality. @tholor could you look into this? It seems to be a bug on our side.

tholor · 2020-11-16T14:05:13Z

Implemented a fix in #629.

In terms of user notification: we already have the "truncation stats" logged by the DataSilo that inform you how many of your samples got truncated. We could extend these stats for the BiAdaptiveModel case where we need to generate two separate stats (for query and passage).

psorianom · 2020-11-17T08:03:41Z

Oh ok, I missed the DataSilo logs. I will take a look at them. Thanks for the fix!

Timoeller assigned tholor Nov 16, 2020

tholor mentioned this issue Nov 16, 2020

Fix truncation of query in DPR #629

Merged

tholor closed this as completed Nov 17, 2020

tholor mentioned this issue Nov 17, 2020

Log truncation stats for BiAdaptiveModel #633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPR: Max length for query tokenizer model encoding #625

DPR: Max length for query tokenizer model encoding #625

psorianom commented Nov 14, 2020

Timoeller commented Nov 16, 2020

tholor commented Nov 16, 2020

psorianom commented Nov 17, 2020

DPR: Max length for query tokenizer model encoding #625

DPR: Max length for query tokenizer model encoding #625

Comments

psorianom commented Nov 14, 2020

Timoeller commented Nov 16, 2020

tholor commented Nov 16, 2020

psorianom commented Nov 17, 2020