Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add warning about deduplication #102

Merged
merged 3 commits into from
Aug 26, 2024
Merged

docs: add warning about deduplication #102

merged 3 commits into from
Aug 26, 2024

Conversation

kjappelbaum
Copy link
Contributor

@kjappelbaum kjappelbaum commented Aug 25, 2024

Screenshot 2024-08-25 at 19 55 26

Summary by Sourcery

Add a warning in the documentation about the limitations of deduplication in pretraining datasets, advising users on potential data leakage issues.

Documentation:

  • Add a warning about the potential issues with deduplication in pretraining datasets, highlighting that datasets are only deduplicated based on the CIF string, which may lead to data leakage if structures with slightly translated positions are present.

@kjappelbaum kjappelbaum requested a review from n0w0f August 25, 2024 17:55
Copy link
Contributor

sourcery-ai bot commented Aug 25, 2024

Reviewer's Guide by Sourcery

This pull request adds a warning about deduplication in the benchmarking documentation. It explains that the pretraining datasets in MatText are only deduplicated based on the CIF string, which may lead to data leakage issues in certain use cases. The PR also includes minor formatting and whitespace changes throughout the document.

File-Level Changes

Change Details Files
Added a warning about deduplication in pretraining datasets
  • Explained that pretraining datasets are only deduplicated based on CIF string
  • Warned about potential data leakage issues
  • Suggested using other representations for deduplication if needed
docs/benchmarking.md
Made minor formatting and whitespace adjustments
  • Removed trailing whitespaces
  • Adjusted line breaks and indentation
  • Fixed typos and improved readability
docs/benchmarking.md

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kjappelbaum - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟡 Documentation: 3 issues found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

Comment on lines -43 to 50
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
```


Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.
You can define the parameters for the experiment hence at `\conf\model\benchmark_example.yaml`.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (documentation): Remove trailing whitespace

There are several instances of trailing whitespace throughout the document. Consider removing all trailing whitespace for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this into a suggestion I can apply?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly! Here's a suggestion you can apply to remove the trailing whitespace:

Suggested change
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
```
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.
You can define the parameters for the experiment hence at `\conf\model\benchmark_example.yaml`.
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.

You can apply this suggestion directly to the diff hunk. Let me know if you need further assistance!

Comment on lines 15 to 17
```bash
python main.py -cn=pretrain model=pretrain_example +model.representation=composition +model.dataset_type=pretrain30k +model.context_length=32
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Standardize code block formatting

Some code blocks use triple backticks with a language specifier (e.g., ```bash), while others use just triple backticks. Consider standardizing all code blocks to include the language specifier for consistency and improved syntax highlighting.

docs/benchmarking.md Outdated Show resolved Hide resolved
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
@kjappelbaum kjappelbaum linked an issue Aug 25, 2024 that may be closed by this pull request
Copy link
Collaborator

@n0w0f n0w0f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Thanks @kjappelbaum

@n0w0f n0w0f merged commit d435e66 into main Aug 26, 2024
1 of 2 checks passed
@n0w0f n0w0f deleted the deduplication-docs branch August 26, 2024 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add docs on duplicate structures in pre-train dataset
2 participants