Describe the current vocabularies supported by ModularTokenizer in the documentation #96

sivanravidos · 2024-01-11T11:23:18Z

Is your feature request related to a problem? Please describe.
The ModularTokenizer is very useful for various biologics tasks and model developments. However it is hard to know from reading its Readme which vocabularies are supported.
Instruction on how to extend the tokenizer
Describe the solution you'd like
I suggest the Readme will start from a tokenizer user perspective. Start with a simple code snippet of how to use it, add a description of the vocabularies are supported and how were they created.
For the later, note that while the amino acid token vocabulary is small, constant and well known, the gene or cell type vocabulary is large and vary between sources. One would wonder how was these vocabularies created and from which trusted source they were take.
Only then I would add the existing sections on the internals of the tokenizer, how to extend it, etc.

Additionaly, I would highlight the tokenizer in the main readme and point to its internal readme.

sivanravidos added the enhancement New feature or request label Jan 11, 2024

mosheraboh assigned floccinauc Jan 16, 2024

floccinauc assigned matanninio Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Describe the current vocabularies supported by ModularTokenizer in the documentation #96

Describe the current vocabularies supported by ModularTokenizer in the documentation #96

sivanravidos commented Jan 11, 2024

Describe the current vocabularies supported by ModularTokenizer in the documentation #96

Describe the current vocabularies supported by ModularTokenizer in the documentation #96

Comments

sivanravidos commented Jan 11, 2024