[BUG] Fix build train valid test datasets #8826

JunnYu · 2024-07-29T09:43:32Z

PR types

Bug fixes

PR changes

APIs

Description

同步https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/collections/nlp/data/language_modeling/megatron/gpt_dataset.py#L74-L80 更新后的代码。
获取sample数量的时候需要使用原始的数量，而不是进行扩充后的数量。

import numpy as np  
  
def build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, size, verbose):  
    """  
    Given multiple datasets and a weighting array, build samples such that it follows those weights.  
      
    Parameters:  
    - dataset_index: NumPy array to store the dataset index for each sample.  
    - dataset_sample_index: NumPy array to store the sample index within each dataset.  
    - weights: NumPy array of weights for each dataset.  
    - num_datasets: Integer, the number of datasets.  
    - size: Integer, the total number of samples to generate.  
    - verbose: Boolean, whether to print verbose output.  
    """  
    if verbose:  
        print("> building indices for blendable datasets ...")  
  
    # Initialize buffer for number of samples used for each dataset.  
    current_samples = np.zeros(num_datasets, dtype=np.int64)  
  
    # For each sample:  
    for sample_idx in range(size):  
        # Determine where the max error in sampling is happening.  
        sample_idx_double = max(sample_idx, 1)  
        max_error_index = 0  
        max_error = weights[0] * sample_idx_double - current_samples[0]  
        for dataset_idx in range(1, num_datasets):  
            error = weights[dataset_idx] * sample_idx_double - current_samples[dataset_idx]  
            if error > max_error:  
                max_error = error  
                max_error_index = dataset_idx  

        # Populate the indices.  
        dataset_index[sample_idx] = max_error_index  
        dataset_sample_index[sample_idx] = current_samples[max_error_index]  
        # Update the total samples.  
        current_samples[max_error_index] += 1  
  
    # Print info  
    if verbose:  
        print(" > sample ratios:")  
        for dataset_idx in range(num_datasets):  
            ratio = current_samples[dataset_idx] / size  
            print(f"   dataset {dataset_idx}, input: {weights[dataset_idx]}, achieved: {ratio}")  

weights = [6.76142772e-01, 4.65481872e-03, 1.50378956e-02, 1.98387035e-04,
    5.06176985e-03, 6.97636962e-04, 3.13866567e-03, 3.20165998e-02,
    3.60524667e-03, 6.90657464e-03, 2.26846735e-02, 7.34873296e-03,
    3.92887512e-05, 3.91225911e-03, 1.01479806e-02, 2.12045055e-03,
    7.26073523e-03, 1.52476576e-02, 4.77683574e-03, 6.46679117e-02,
    4.21797692e-02, 6.46304351e-02, 5.13122989e-03, 1.99474356e-03,
    5.01820438e-06, 3.14517551e-05, 7.83280489e-05, 7.54838022e-05,
    1.34179804e-04, 7.24675664e-05]

num_datasets = len(weights)

expanded_size = 4347
verbose = False  
dataset_index = np.zeros(expanded_size, dtype=np.uint8)
dataset_sample_index = np.zeros(expanded_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, expanded_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充后", dataset_sample_index[dataset_index==0], "超过了dataset 0的 total number of samples:   2548")

raw_size = 3727
dataset_index = np.zeros(raw_size, dtype=np.uint8)
dataset_sample_index = np.zeros(raw_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, raw_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充前", dataset_sample_index[dataset_index==0], "没有超过dataset 0的 total number of samples:   2548")
# 扩充后 [   0    1    2 ... 2936 2937 2938] 超过了dataset 0的 total number of samples:   2548
# 扩充前 [   0    1    2 ... 2516 2517 2518] 没有超过dataset 0的 total number of samples:   2548

paddle-bot · 2024-07-29T09:43:37Z

Thanks for your contribution!

codecov · 2024-07-30T04:24:57Z

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 55.50%. Comparing base (ee4944e) to head (f9457d6).
Report is 228 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/data/causal_dataset.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8826      +/-   ##
===========================================
+ Coverage    55.44%   55.50%   +0.06%     
===========================================
  Files          631      631              
  Lines        98542    98544       +2     
===========================================
+ Hits         54632    54699      +67     
+ Misses       43910    43845      -65

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* train_val_test_num_samples

update

5438da7

train_val_test_num_samples

f9457d6

JunnYu requested a review from ZHUI July 31, 2024 03:24

ZHUI approved these changes Jul 31, 2024

View reviewed changes

ZHUI merged commit fe7e2fe into PaddlePaddle:develop Jul 31, 2024
10 of 12 checks passed

DrownFish19 pushed a commit to DrownFish19/PaddleNLP that referenced this pull request Aug 2, 2024

[BUG] Fix build train valid test datasets (PaddlePaddle#8826)

0fef019

* train_val_test_num_samples

DrownFish19 pushed a commit to DrownFish19/PaddleNLP that referenced this pull request Aug 2, 2024

[BUG] Fix build train valid test datasets (PaddlePaddle#8826)

9e7d7e5

* train_val_test_num_samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix build train valid test datasets #8826

[BUG] Fix build train valid test datasets #8826

JunnYu commented Jul 29, 2024

paddle-bot bot commented Jul 29, 2024

codecov bot commented Jul 30, 2024 •

edited

Loading

[BUG] Fix build train valid test datasets #8826

[BUG] Fix build train valid test datasets #8826

Conversation

JunnYu commented Jul 29, 2024

PR types

PR changes

Description

paddle-bot bot commented Jul 29, 2024

codecov bot commented Jul 30, 2024 • edited Loading

Codecov Report

codecov bot commented Jul 30, 2024 •

edited

Loading