You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in the annif-users-group (https://groups.google.com/g/annif-users/c/8d3AL4LAzBQ), I have added the debugging lines and performed the suggest operation with an MLLM model trained with the full GND vocabulary set we use (1.4M subjects) on a document with a long processing time (305.72 sec.). Please find the ziped tsets.jsonl file attached to this issue.
The tsets.jsonl file is quite revealing: you have some matches with extreme repetition, especially token id 194284. I'm not sure what it is without having access to the model internals, but it seems to be some word that matches a lot of different GND subjects (2628 to be exact). It could be a common name like "Smith"; if you have a lot of names like "Smith, A." and "Smith, B." (even as altLabels) in GND, then the analyzer in Annif will likely discard the initials (because they are too short to be considered words) and MLLM will just see a lot of concepts all having the same label "smith", which are potential matches every time the word "Smith" appears in the document text.
I'll see if anything can be done to speed up the slow ambiguity calculation, but this is a symptom of matching gone wrong in other ways as well.
osma
linked a pull request
Dec 20, 2024
that will
close
this issue
Hi @RietdorfC , I've now implemented a new, hopefully much faster method for calculating the ambiguity feature in PR #825. Could you please test the code in that branch? I'm especially interested in
Does the code run in your environment?
Does it reduce the train and suggest time for MLLM?
Dear Osma, dear annif-team,
As discussed in the annif-users-group (https://groups.google.com/g/annif-users/c/8d3AL4LAzBQ), I have added the debugging lines and performed the suggest operation with an MLLM model trained with the full GND vocabulary set we use (1.4M subjects) on a document with a long processing time (305.72 sec.). Please find the ziped tsets.jsonl file attached to this issue.
Best regards
Clemens
tsets.zip
The text was updated successfully, but these errors were encountered: