Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support sft mapdataset #8840

Merged
merged 5 commits into from
Aug 5, 2024

Conversation

greycooker
Copy link
Contributor

@greycooker greycooker commented Jul 30, 2024

PR types

New Features

PR changes

Add SFTMMapIndexedDataset and SFTMMapIndexedDatasetBuilder

Description

Support offline SFT dataset.

Copy link

paddle-bot bot commented Jul 30, 2024

Thanks for your contribution!

Copy link

codecov bot commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 20.58824% with 162 lines in your changes missing coverage. Please review.

Project coverage is 55.35%. Comparing base (ee4944e) to head (e988cf5).
Report is 10 commits behind head on develop.

Current head e988cf5 differs from pull request most recent head ecb62b6

Please upload reports for the commit ecb62b6 to get more accurate results.

Files Patch % Lines
paddlenlp/data/indexed_dataset.py 20.58% 162 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8840      +/-   ##
===========================================
- Coverage    55.44%   55.35%   -0.10%     
===========================================
  Files          631      631              
  Lines        98542    98782     +240     
===========================================
+ Hits         54632    54676      +44     
- Misses       43910    44106     +196     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -68,6 +69,20 @@ def make_dataset(path, impl, skip_warmup=False):
return None


def make_sft_dataset(path, impl, dataclass, skip_warmup=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要么就只支持mmap的吧,不用判断了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改,不是mmap直接报错

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议:make_sft_dataset(path, dataclass, skip_warmup=False, impl=“mmap”)

@@ -548,13 +574,259 @@ def exists(path):
return os.path.exists(index_file_path(path)) and os.path.exists(data_file_path(path))


class SFT_MMapIndexedDataset(paddle.io.Dataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里class采用驼峰命名,不要下划线。

paddlenlp/data/indexed_dataset.py Show resolved Hide resolved
def make_builder(out_file, impl, save_dtype, loss_mask_file=None):
if impl == "mmap":
return MMapIndexedDatasetBuilder(out_file, dtype=save_dtype, loss_mask_file=loss_mask_file)
else:
return IndexedDatasetBuilder(out_file, dtype=save_dtype)


class SFT_MMapIndexedDatasetBuilder(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个命名同样

Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit c4d1abf into PaddlePaddle:develop Aug 5, 2024
10 of 12 checks passed
@greycooker greycooker deleted the support_sft_mapdataset branch August 29, 2024 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants