Evo2 #694

jstjohn · 2025-02-19T16:21:37Z

Description

This provides an implementation of Evo2 supporting pre-training, fine-tuning and preprocessing of data for Evo2 from fasta files.

Known issues

1M context dataset pre-training depends on a nearly finished commit to Megatron-LM.
Verification of accuracy has been completed on the 7B parameter 8k context setting. Analysis of other settings are in progress.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

…oNeMo.

…ing, default 0 for topk/topp.

…Hyena.

… debt in tokenizer and config, remove unused args in infer.py.

…d add transcript splicing script for preprocessing.

…ate test to use new checkpoint

…ces and handling divide by zero in loss

ci/scripts/megatron-lm-mr2604-torch-dist-ckpt-size.patch

codecov-commenter · 2025-02-19T17:06:32Z

❌ 14 Tests Failed:

Tests completed	Failed	Passed	Skipped
919	14	905	12

View the top 3 failed test(s) by shortest run time

sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_appends_eod_token

Stack Traces | 0.001s run time

@pytest.fixture
    def tokenizer() -> Evo2Tokenizer:
>       return Evo2Tokenizer(Evo2PreprocessingConfig())

.../evo2/data/test_tokenizer.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../evo2/data/tokenizer.py:34: in __init__
    self.tokenizer: TokenizerSpec = get_nmt_tokenizer(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

library = 'byte-level', model_name = None, tokenizer_model = None
vocab_file = None, merges_file = None, special_tokens = {}, use_fast = False
bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
trust_remote_code = False, chat_template = None, vocab_size = None

    def get_nmt_tokenizer(
        library: str = 'sentencepiece',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
        chat_template: Optional[Dict] = None,
        vocab_size: Optional[int] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (experimental) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        import omegaconf
        from omegaconf import OmegaConf
    
        if isinstance(special_tokens, (omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig)):
            special_tokens = OmegaConf.to_container(special_tokens)
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens
    
        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
            raise ValueError("No Tokenizer path provided or file does not exist!")
    
        if library == 'huggingface':
            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
    
            logging.info(f'Getting HuggingFace AutoTokenizer with pretrained_model_name: {model_name}')
            return AutoTokenizer(
                pretrained_model_name=model_name,
                vocab_file=vocab_file,
                merges_file=merges_file,
                **special_tokens_dict,
                use_fast=use_fast,
                trust_remote_code=trust_remote_code,
            )
        elif library == 'sentencepiece':
            from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
    
            logging.info(f'Getting SentencePiece with model: {tokenizer_model}')
    
            return SentencePieceTokenizer(
                model_path=tokenizer_model,
                special_tokens=special_tokens,
                legacy=legacy,
                chat_template=chat_template,
            )
        elif library == 'byte-level':
            from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
    
            logging.info(f'Using byte-level tokenization')
>           return ByteLevelTokenizer(special_tokens_dict)
E           TypeError: Can't instantiate abstract class ByteLevelTokenizer without an implementation for abstract methods 'ids_to_text', 'ids_to_tokens', 'text_to_ids', 'text_to_tokens', 'tokens_to_ids', 'tokens_to_text'

.../local/lib/python3.12.../modules/common/tokenizer_utils.py:215: TypeError

sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_handles_long_dna_sequence

Stack Traces | 0.001s run time

@pytest.fixture
    def tokenizer() -> Evo2Tokenizer:
>       return Evo2Tokenizer(Evo2PreprocessingConfig())

.../evo2/data/test_tokenizer.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../evo2/data/tokenizer.py:34: in __init__
    self.tokenizer: TokenizerSpec = get_nmt_tokenizer(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

library = 'byte-level', model_name = None, tokenizer_model = None
vocab_file = None, merges_file = None, special_tokens = {}, use_fast = False
bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
trust_remote_code = False, chat_template = None, vocab_size = None

    def get_nmt_tokenizer(
        library: str = 'sentencepiece',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
        chat_template: Optional[Dict] = None,
        vocab_size: Optional[int] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (experimental) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        import omegaconf
        from omegaconf import OmegaConf
    
        if isinstance(special_tokens, (omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig)):
            special_tokens = OmegaConf.to_container(special_tokens)
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens
    
        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
            raise ValueError("No Tokenizer path provided or file does not exist!")
    
        if library == 'huggingface':
            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
    
            logging.info(f'Getting HuggingFace AutoTokenizer with pretrained_model_name: {model_name}')
            return AutoTokenizer(
                pretrained_model_name=model_name,
                vocab_file=vocab_file,
                merges_file=merges_file,
                **special_tokens_dict,
                use_fast=use_fast,
                trust_remote_code=trust_remote_code,
            )
        elif library == 'sentencepiece':
            from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
    
            logging.info(f'Getting SentencePiece with model: {tokenizer_model}')
    
            return SentencePieceTokenizer(
                model_path=tokenizer_model,
                special_tokens=special_tokens,
                legacy=legacy,
                chat_template=chat_template,
            )
        elif library == 'byte-level':
            from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
    
            logging.info(f'Using byte-level tokenization')
>           return ByteLevelTokenizer(special_tokens_dict)
E           TypeError: Can't instantiate abstract class ByteLevelTokenizer without an implementation for abstract methods 'ids_to_text', 'ids_to_tokens', 'text_to_ids', 'text_to_tokens', 'tokens_to_ids', 'tokens_to_text'

.../local/lib/python3.12.../modules/common/tokenizer_utils.py:215: TypeError

sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_pads_sequence_to_required_length

Stack Traces | 0.001s run time

@pytest.fixture
    def tokenizer() -> Evo2Tokenizer:
>       return Evo2Tokenizer(Evo2PreprocessingConfig())

.../evo2/data/test_tokenizer.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../evo2/data/tokenizer.py:34: in __init__
    self.tokenizer: TokenizerSpec = get_nmt_tokenizer(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

library = 'byte-level', model_name = None, tokenizer_model = None
vocab_file = None, merges_file = None, special_tokens = {}, use_fast = False
bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
trust_remote_code = False, chat_template = None, vocab_size = None

    def get_nmt_tokenizer(
        library: str = 'sentencepiece',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
        chat_template: Optional[Dict] = None,
        vocab_size: Optional[int] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (experimental) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        import omegaconf
        from omegaconf import OmegaConf
    
        if isinstance(special_tokens, (omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig)):
            special_tokens = OmegaConf.to_container(special_tokens)
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens
    
        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
            raise ValueError("No Tokenizer path provided or file does not exist!")
    
        if library == 'huggingface':
            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
    
            logging.info(f'Getting HuggingFace AutoTokenizer with pretrained_model_name: {model_name}')
            return AutoTokenizer(
                pretrained_model_name=model_name,
                vocab_file=vocab_file,
                merges_file=merges_file,
                **special_tokens_dict,
                use_fast=use_fast,
                trust_remote_code=trust_remote_code,
            )
        elif library == 'sentencepiece':
            from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
    
            logging.info(f'Getting SentencePiece with model: {tokenizer_model}')
    
            return SentencePieceTokenizer(
                model_path=tokenizer_model,
                special_tokens=special_tokens,
                legacy=legacy,
                chat_template=chat_template,
            )
        elif library == 'byte-level':
            from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
    
            logging.info(f'Using byte-level tokenization')
>           return ByteLevelTokenizer(special_tokens_dict)
E           TypeError: Can't instantiate abstract class ByteLevelTokenizer without an implementation for abstract methods 'ids_to_text', 'ids_to_tokens', 'text_to_ids', 'text_to_tokens', 'tokens_to_ids', 'tokens_to_text'

.../local/lib/python3.12.../modules/common/tokenizer_utils.py:215: TypeError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

skothenhill-nv · 2025-02-19T18:16:12Z

sub-packages/bionemo-core/tests/bionemo/core/data/test_load.py

-        ["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"],
+        ["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"],


Is this intentional?

I don't think it is, we should switch this back as part of moving everything into NGC before we merge.

Signed-off-by: John St John <jstjohn@nvidia.com>

…r than the old code Signed-off-by: John St John <jstjohn@nvidia.com>

Signed-off-by: John St. John <jstjohn@nvidia.com>

jstjohn · 2025-02-20T00:00:59Z

/build-ci

…697) Point megatron to commit with our fix merged. Signed-off-by: Jared Wilber <jwilber@nvidia.com>

Signed-off-by: John St John <jstjohn@nvidia.com>

cspades and others added 30 commits November 16, 2024 10:46

[cye/evo2-llm-dev] Private internal development branch for Evo2 in Bi…

50db0ca

…oNeMo.

[cye/evo2-llm-dev] Add rough draft of data preprocessing for Evo2.

737f16c

Add manual data test for evo2

a142109

Change remotes for submodules for now

0ad0bee

Cye/nemo2 fixes

82c832f

Write model checkpoint context and set Evo2Dataset in the pre-training.

945506f

Fix inference script to make sense, i.e. no seq parallelism for decod…

4fc1d84

…ing, default 0 for topk/topp.

Cye/fix Hyena species biases

f5adde5

Hyena golden value test

b9dfd5c

[cye/blended-training] Expose blended weights for training Hyena.

e6278d9

Changes for 256 node training run

dd0aab1

Integrate BioNeMo Noodles into Hyena data preprocessing.

0560ee4

[cye/lineage-str] Clean up interface for taxonomic lineage tokens in …

5511fe7

…Hyena.

Changes made on 256 node branch

92d0352

Cye/hyena flops

923cbdf

Fix broken import of blended training config.

6460ea3

Cye/import fix

7e72f48

Add improved nsys profiling support

45923c6

[cye/hyena-doc-update] Add data preprocessing documentation, fix tech…

c805984

… debt in tokenizer and config, remove unused args in infer.py.

[cye/transcript-readme] Add main documentation snippets for Hyena, an…

f5b15f3

…d add transcript splicing script for preprocessing.

Bump nemo version to the new context length insensitive code, and upd…

9ba9e07

…ate test to use new checkpoint

added flag for tflops callback

854951f

[cye/evo2-ckpt-utils] Add Evo2 ZeRO-1/3 to NeMo checkpointing utils.

ada349e

Add test for evo2 tokenizer.

652dfe0

Fix nemo-savanna repo build in CI

265a0be

fixing format issues on evo2-dev

fb09377

Add tests for parallel hyena operators used in evo2

9cacf1b

Rebase on OSS.

9ac11eb

[cye/tp-comm-fix] Fix TP communication overlap inconsistency.

5631b93

Add temporary fix for shard-tensor bug in Megatron-LM

9ae9af0

jstjohn added 5 commits February 18, 2025 22:34

Bump nemo version with better tested

af9016e

Revert loss mask updates

aafb7a3

handle 0 token case more gracefully

a966b8b

bump NeMo with proper handling of control character containing sequen…

c4ef1f1

…ces and handling divide by zero in loss

Update remote pointers to new public NeMo branches

0976fac

jstjohn requested review from farhadrgh, dorotat-nv, malcolmgreaves, pstjohn, skothenhill-nv, DejunL, guoqing-zhou, trvachov, jwilber, sichu2023, jomitchellnv and cspades as code owners February 19, 2025 16:21

cspades reviewed Feb 19, 2025

View reviewed changes

ci/scripts/megatron-lm-mr2604-torch-dist-ckpt-size.patch Outdated Show resolved Hide resolved

Remove unused Megatron torch_dist sizing patch.

04982ae

skothenhill-nv reviewed Feb 19, 2025

View reviewed changes

jstjohn added 5 commits February 19, 2025 20:56

Remove fasta from test and replace with synthetic sequence

242f3fe

Signed-off-by: John St John <jstjohn@nvidia.com>

Move fasta creation utility into testing sub-package

22ada77

Signed-off-by: John St John <jstjohn@nvidia.com>

Add a test that verifies that the new phylo tag masking code is faste…

b5bdec8

…r than the old code Signed-off-by: John St John <jstjohn@nvidia.com>

Move phylo tag benchmark to NeMo testing

ac1bd1f

Signed-off-by: John St. John <jstjohn@nvidia.com>

Merge in main

bfaebd1

Signed-off-by: John St. John <jstjohn@nvidia.com>

jwilber and others added 3 commits February 20, 2025 13:56

Update Megatron-LM submodule to commit 62529f1d (has 1M context fix) (#…

0ae0c50

…697) Point megatron to commit with our fix merged. Signed-off-by: Jared Wilber <jwilber@nvidia.com>

fix config typo in test

2ba5da3

Signed-off-by: John St John <jstjohn@nvidia.com>

bump NeMo to latest PR version

253a7f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evo2 #694

Evo2 #694

jstjohn commented Feb 19, 2025

codecov-commenter commented Feb 19, 2025 •

edited

Loading

skothenhill-nv Feb 19, 2025

jstjohn Feb 19, 2025

jstjohn commented Feb 20, 2025

		["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"],
		["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"],

Evo2 #694

Are you sure you want to change the base?

Evo2 #694

Conversation

jstjohn commented Feb 19, 2025

Description

Known issues

Type of changes

Pre-submit Checklist

codecov-commenter commented Feb 19, 2025 • edited Loading

❌ 14 Tests Failed:

skothenhill-nv Feb 19, 2025

Choose a reason for hiding this comment

jstjohn Feb 19, 2025

Choose a reason for hiding this comment

jstjohn commented Feb 20, 2025

codecov-commenter commented Feb 19, 2025 •

edited

Loading