You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The ModularTokenizer is very useful for various biologics tasks and model developments. However it is hard to know from reading its Readme which vocabularies are supported.
Instruction on how to extend the tokenizer Describe the solution you'd like
I suggest the Readme will start from a tokenizer user perspective. Start with a simple code snippet of how to use it, add a description of the vocabularies are supported and how were they created.
For the later, note that while the amino acid token vocabulary is small, constant and well known, the gene or cell type vocabulary is large and vary between sources. One would wonder how was these vocabularies created and from which trusted source they were take.
Only then I would add the existing sections on the internals of the tokenizer, how to extend it, etc.
Additionaly, I would highlight the tokenizer in the main readme and point to its internal readme.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The ModularTokenizer is very useful for various biologics tasks and model developments. However it is hard to know from reading its Readme which vocabularies are supported.
Instruction on how to extend the tokenizer
Describe the solution you'd like
I suggest the Readme will start from a tokenizer user perspective. Start with a simple code snippet of how to use it, add a description of the vocabularies are supported and how were they created.
For the later, note that while the amino acid token vocabulary is small, constant and well known, the gene or cell type vocabulary is large and vary between sources. One would wonder how was these vocabularies created and from which trusted source they were take.
Only then I would add the existing sections on the internals of the tokenizer, how to extend it, etc.
Additionaly, I would highlight the tokenizer in the main readme and point to its internal readme.
The text was updated successfully, but these errors were encountered: