Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evo2 #694

Open
wants to merge 85 commits into
base: main
Choose a base branch
from
Open

Evo2 #694

wants to merge 85 commits into from

Conversation

jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Feb 19, 2025

Description

This provides an implementation of Evo2 supporting pre-training, fine-tuning and preprocessing of data for Evo2 from fasta files.

Known issues

  • 1M context dataset pre-training depends on a nearly finished commit to Megatron-LM.
  • Verification of accuracy has been completed on the 7B parameter 8k context setting. Analysis of other settings are in progress.

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

cspades and others added 30 commits November 16, 2024 10:46
… debt in tokenizer and config, remove unused args in infer.py.
…d add transcript splicing script for preprocessing.
@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2025

❌ 14 Tests Failed:

Tests completed Failed Passed Skipped
919 14 905 12
View the top 3 failed test(s) by shortest run time
sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_appends_eod_token
Stack Traces | 0.001s run time
@pytest.fixture
    def tokenizer() -> Evo2Tokenizer:
>       return Evo2Tokenizer(Evo2PreprocessingConfig())

.../evo2/data/test_tokenizer.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../evo2/data/tokenizer.py:34: in __init__
    self.tokenizer: TokenizerSpec = get_nmt_tokenizer(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

library = 'byte-level', model_name = None, tokenizer_model = None
vocab_file = None, merges_file = None, special_tokens = {}, use_fast = False
bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
trust_remote_code = False, chat_template = None, vocab_size = None

    def get_nmt_tokenizer(
        library: str = 'sentencepiece',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
        chat_template: Optional[Dict] = None,
        vocab_size: Optional[int] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (experimental) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        import omegaconf
        from omegaconf import OmegaConf
    
        if isinstance(special_tokens, (omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig)):
            special_tokens = OmegaConf.to_container(special_tokens)
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens
    
        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
            raise ValueError("No Tokenizer path provided or file does not exist!")
    
        if library == 'huggingface':
            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
    
            logging.info(f'Getting HuggingFace AutoTokenizer with pretrained_model_name: {model_name}')
            return AutoTokenizer(
                pretrained_model_name=model_name,
                vocab_file=vocab_file,
                merges_file=merges_file,
                **special_tokens_dict,
                use_fast=use_fast,
                trust_remote_code=trust_remote_code,
            )
        elif library == 'sentencepiece':
            from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
    
            logging.info(f'Getting SentencePiece with model: {tokenizer_model}')
    
            return SentencePieceTokenizer(
                model_path=tokenizer_model,
                special_tokens=special_tokens,
                legacy=legacy,
                chat_template=chat_template,
            )
        elif library == 'byte-level':
            from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
    
            logging.info(f'Using byte-level tokenization')
>           return ByteLevelTokenizer(special_tokens_dict)
E           TypeError: Can't instantiate abstract class ByteLevelTokenizer without an implementation for abstract methods 'ids_to_text', 'ids_to_tokens', 'text_to_ids', 'text_to_tokens', 'tokens_to_ids', 'tokens_to_text'

.../local/lib/python3.12.../modules/common/tokenizer_utils.py:215: TypeError
sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_handles_long_dna_sequence
Stack Traces | 0.001s run time
@pytest.fixture
    def tokenizer() -> Evo2Tokenizer:
>       return Evo2Tokenizer(Evo2PreprocessingConfig())

.../evo2/data/test_tokenizer.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../evo2/data/tokenizer.py:34: in __init__
    self.tokenizer: TokenizerSpec = get_nmt_tokenizer(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

library = 'byte-level', model_name = None, tokenizer_model = None
vocab_file = None, merges_file = None, special_tokens = {}, use_fast = False
bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
trust_remote_code = False, chat_template = None, vocab_size = None

    def get_nmt_tokenizer(
        library: str = 'sentencepiece',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
        chat_template: Optional[Dict] = None,
        vocab_size: Optional[int] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (experimental) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        import omegaconf
        from omegaconf import OmegaConf
    
        if isinstance(special_tokens, (omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig)):
            special_tokens = OmegaConf.to_container(special_tokens)
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens
    
        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
            raise ValueError("No Tokenizer path provided or file does not exist!")
    
        if library == 'huggingface':
            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
    
            logging.info(f'Getting HuggingFace AutoTokenizer with pretrained_model_name: {model_name}')
            return AutoTokenizer(
                pretrained_model_name=model_name,
                vocab_file=vocab_file,
                merges_file=merges_file,
                **special_tokens_dict,
                use_fast=use_fast,
                trust_remote_code=trust_remote_code,
            )
        elif library == 'sentencepiece':
            from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
    
            logging.info(f'Getting SentencePiece with model: {tokenizer_model}')
    
            return SentencePieceTokenizer(
                model_path=tokenizer_model,
                special_tokens=special_tokens,
                legacy=legacy,
                chat_template=chat_template,
            )
        elif library == 'byte-level':
            from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
    
            logging.info(f'Using byte-level tokenization')
>           return ByteLevelTokenizer(special_tokens_dict)
E           TypeError: Can't instantiate abstract class ByteLevelTokenizer without an implementation for abstract methods 'ids_to_text', 'ids_to_tokens', 'text_to_ids', 'text_to_tokens', 'tokens_to_ids', 'tokens_to_text'

.../local/lib/python3.12.../modules/common/tokenizer_utils.py:215: TypeError
sub-packages/bionemo-evo2/tests/bionemo/evo2/data/test_tokenizer.py::test_tokenizer_pads_sequence_to_required_length
Stack Traces | 0.001s run time
@pytest.fixture
    def tokenizer() -> Evo2Tokenizer:
>       return Evo2Tokenizer(Evo2PreprocessingConfig())

.../evo2/data/test_tokenizer.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../evo2/data/tokenizer.py:34: in __init__
    self.tokenizer: TokenizerSpec = get_nmt_tokenizer(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

library = 'byte-level', model_name = None, tokenizer_model = None
vocab_file = None, merges_file = None, special_tokens = {}, use_fast = False
bpe_dropout = 0.0, r2l = False, legacy = False, delimiter = None
trust_remote_code = False, chat_template = None, vocab_size = None

    def get_nmt_tokenizer(
        library: str = 'sentencepiece',
        model_name: Optional[str] = None,
        tokenizer_model: Optional[str] = None,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        special_tokens: Optional[Dict[str, str]] = None,
        use_fast: Optional[bool] = False,
        bpe_dropout: Optional[float] = 0.0,
        r2l: Optional[bool] = False,
        legacy: Optional[bool] = False,
        delimiter: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
        chat_template: Optional[Dict] = None,
        vocab_size: Optional[int] = None,
    ):
        """
        Args:
            model_name: if using a pretrained model from NeMo, HuggingFace, or Megatron
            tokenizer_model: tokenizer model file of sentencepiece
            special_tokens: dict of special tokens
            vocab_file: path to vocab file
            use_fast: (only for HuggingFace AutoTokenizer) set to True to use fast HuggingFace tokenizer
            bpe_dropout: (experimental) BPE dropout tries to corrupt the standard segmentation procedure
                of BPE to help model better learn word compositionality and become robust to segmentation errors.
                It has empirically been shown to improve inference time BLEU scores.
            r2l: Whether to return subword IDs from right to left
        """
        import omegaconf
        from omegaconf import OmegaConf
    
        if isinstance(special_tokens, (omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig)):
            special_tokens = OmegaConf.to_container(special_tokens)
        if special_tokens is None:
            special_tokens_dict = {}
        else:
            special_tokens_dict = special_tokens
    
        if (library != 'byte-level') and (
            model_name is None and (tokenizer_model is None or not os.path.isfile(tokenizer_model))
        ):
            raise ValueError("No Tokenizer path provided or file does not exist!")
    
        if library == 'huggingface':
            from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
    
            logging.info(f'Getting HuggingFace AutoTokenizer with pretrained_model_name: {model_name}')
            return AutoTokenizer(
                pretrained_model_name=model_name,
                vocab_file=vocab_file,
                merges_file=merges_file,
                **special_tokens_dict,
                use_fast=use_fast,
                trust_remote_code=trust_remote_code,
            )
        elif library == 'sentencepiece':
            from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
    
            logging.info(f'Getting SentencePiece with model: {tokenizer_model}')
    
            return SentencePieceTokenizer(
                model_path=tokenizer_model,
                special_tokens=special_tokens,
                legacy=legacy,
                chat_template=chat_template,
            )
        elif library == 'byte-level':
            from nemo.collections.common.tokenizers.bytelevel_tokenizers import ByteLevelTokenizer
    
            logging.info(f'Using byte-level tokenization')
>           return ByteLevelTokenizer(special_tokens_dict)
E           TypeError: Can't instantiate abstract class ByteLevelTokenizer without an implementation for abstract methods 'ids_to_text', 'ids_to_tokens', 'text_to_ids', 'text_to_tokens', 'tokens_to_ids', 'tokens_to_text'

.../local/lib/python3.12.../modules/common/tokenizer_utils.py:215: TypeError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment on lines -49 to +50
["download_bionemo_data", "--source", "ngc", "single_cell/testdata-20240506"],
["download_bionemo_data", "--source", "pbss", "single_cell/testdata-20240506"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is, we should switch this back as part of moving everything into NGC before we merge.

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
…r than the old code

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St. John <jstjohn@nvidia.com>
Signed-off-by: John St. John <jstjohn@nvidia.com>
@jstjohn
Copy link
Collaborator Author

jstjohn commented Feb 20, 2025

/build-ci

jwilber and others added 3 commits February 20, 2025 13:56
…697)

Point megatron to commit with our fix merged.

Signed-off-by: Jared Wilber <jwilber@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants