Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix build train valid test datasets #8826

Merged

Conversation

JunnYu
Copy link
Member

@JunnYu JunnYu commented Jul 29, 2024

PR types

Bug fixes

PR changes

APIs

Description

同步https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/collections/nlp/data/language_modeling/megatron/gpt_dataset.py#L74-L80 更新后的代码。
获取sample数量的时候需要使用原始的数量,而不是进行扩充后的数量。

import numpy as np  
  
def build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, size, verbose):  
    """  
    Given multiple datasets and a weighting array, build samples such that it follows those weights.  
      
    Parameters:  
    - dataset_index: NumPy array to store the dataset index for each sample.  
    - dataset_sample_index: NumPy array to store the sample index within each dataset.  
    - weights: NumPy array of weights for each dataset.  
    - num_datasets: Integer, the number of datasets.  
    - size: Integer, the total number of samples to generate.  
    - verbose: Boolean, whether to print verbose output.  
    """  
    if verbose:  
        print("> building indices for blendable datasets ...")  
  
    # Initialize buffer for number of samples used for each dataset.  
    current_samples = np.zeros(num_datasets, dtype=np.int64)  
  
    # For each sample:  
    for sample_idx in range(size):  
        # Determine where the max error in sampling is happening.  
        sample_idx_double = max(sample_idx, 1)  
        max_error_index = 0  
        max_error = weights[0] * sample_idx_double - current_samples[0]  
        for dataset_idx in range(1, num_datasets):  
            error = weights[dataset_idx] * sample_idx_double - current_samples[dataset_idx]  
            if error > max_error:  
                max_error = error  
                max_error_index = dataset_idx  

        # Populate the indices.  
        dataset_index[sample_idx] = max_error_index  
        dataset_sample_index[sample_idx] = current_samples[max_error_index]  
        # Update the total samples.  
        current_samples[max_error_index] += 1  
  
    # Print info  
    if verbose:  
        print(" > sample ratios:")  
        for dataset_idx in range(num_datasets):  
            ratio = current_samples[dataset_idx] / size  
            print(f"   dataset {dataset_idx}, input: {weights[dataset_idx]}, achieved: {ratio}")  

weights = [6.76142772e-01, 4.65481872e-03, 1.50378956e-02, 1.98387035e-04,
    5.06176985e-03, 6.97636962e-04, 3.13866567e-03, 3.20165998e-02,
    3.60524667e-03, 6.90657464e-03, 2.26846735e-02, 7.34873296e-03,
    3.92887512e-05, 3.91225911e-03, 1.01479806e-02, 2.12045055e-03,
    7.26073523e-03, 1.52476576e-02, 4.77683574e-03, 6.46679117e-02,
    4.21797692e-02, 6.46304351e-02, 5.13122989e-03, 1.99474356e-03,
    5.01820438e-06, 3.14517551e-05, 7.83280489e-05, 7.54838022e-05,
    1.34179804e-04, 7.24675664e-05]

num_datasets = len(weights)

expanded_size = 4347
verbose = False  
dataset_index = np.zeros(expanded_size, dtype=np.uint8)
dataset_sample_index = np.zeros(expanded_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, expanded_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充后", dataset_sample_index[dataset_index==0], "超过了dataset 0的 total number of samples:   2548")

raw_size = 3727
dataset_index = np.zeros(raw_size, dtype=np.uint8)
dataset_sample_index = np.zeros(raw_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, raw_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充前", dataset_sample_index[dataset_index==0], "没有超过dataset 0的 total number of samples:   2548")
# 扩充后 [   0    1    2 ... 2936 2937 2938] 超过了dataset 0的 total number of samples:   2548
# 扩充前 [   0    1    2 ... 2516 2517 2518] 没有超过dataset 0的 total number of samples:   2548

Copy link

paddle-bot bot commented Jul 29, 2024

Thanks for your contribution!

Copy link

codecov bot commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 55.50%. Comparing base (ee4944e) to head (f9457d6).
Report is 228 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/data/causal_dataset.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8826      +/-   ##
===========================================
+ Coverage    55.44%   55.50%   +0.06%     
===========================================
  Files          631      631              
  Lines        98542    98544       +2     
===========================================
+ Hits         54632    54699      +67     
+ Misses       43910    43845      -65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JunnYu JunnYu requested a review from ZHUI July 31, 2024 03:24
@ZHUI ZHUI merged commit fe7e2fe into PaddlePaddle:develop Jul 31, 2024
10 of 12 checks passed
DrownFish19 pushed a commit to DrownFish19/PaddleNLP that referenced this pull request Aug 2, 2024
DrownFish19 pushed a commit to DrownFish19/PaddleNLP that referenced this pull request Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants