docs: add warning about deduplication #102

kjappelbaum · 2024-08-25T17:55:58Z

Summary by Sourcery

Add a warning in the documentation about the limitations of deduplication in pretraining datasets, advising users on potential data leakage issues.

Documentation:

Add a warning about the potential issues with deduplication in pretraining datasets, highlighting that datasets are only deduplicated based on the CIF string, which may lead to data leakage if structures with slightly translated positions are present.

sourcery-ai · 2024-08-25T17:56:08Z

Reviewer's Guide by Sourcery

This pull request adds a warning about deduplication in the benchmarking documentation. It explains that the pretraining datasets in MatText are only deduplicated based on the CIF string, which may lead to data leakage issues in certain use cases. The PR also includes minor formatting and whitespace changes throughout the document.

File-Level Changes

Change	Details	Files
Added a warning about deduplication in pretraining datasets	Explained that pretraining datasets are only deduplicated based on CIF string Warned about potential data leakage issues Suggested using other representations for deduplication if needed	`docs/benchmarking.md`
Made minor formatting and whitespace adjustments	Removed trailing whitespaces Adjusted line breaks and indentation Fixed typos and improved readability	`docs/benchmarking.md`

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @kjappelbaum - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟡 Documentation: 3 issues found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

sourcery-ai · 2024-08-25T17:57:00Z

docs/benchmarking.md

-python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint  
+python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
 ```


-Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`. 
+Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.
 You can define the parameters for the experiment hence at `\conf\model\benchmark_example.yaml`.




nitpick (documentation): Remove trailing whitespace

There are several instances of trailing whitespace throughout the document. Consider removing all trailing whitespace for consistency.

can you make this into a suggestion I can apply?

Certainly! Here's a suggestion you can apply to remove the trailing whitespace:

Suggested change

python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint

python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint

```

Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.

Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.

You can define the parameters for the experiment hence at `\conf\model\benchmark_example.yaml`.

Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`.

You can apply this suggestion directly to the diff hunk. Let me know if you need further assistance!

sourcery-ai · 2024-08-25T17:57:00Z

docs/benchmarking.md

 ```bash
 python main.py -cn=pretrain model=pretrain_example +model.representation=composition +model.dataset_type=pretrain30k +model.context_length=32
 ```


suggestion (documentation): Standardize code block formatting

Some code blocks use triple backticks with a language specifier (e.g., ```bash), while others use just triple backticks. Consider standardizing all code blocks to include the language specifier for consistency and improved syntax highlighting.

docs/benchmarking.md

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

n0w0f

Looks great to me. Thanks @kjappelbaum

kjappelbaum added 2 commits August 23, 2024 21:35

docs: add warning about deduplication

cbe5fed

add warning in docs

be44bf4

kjappelbaum requested a review from n0w0f August 25, 2024 17:55

sourcery-ai bot reviewed Aug 25, 2024

View reviewed changes

Update docs/benchmarking.md

e7e8a27

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

kjappelbaum linked an issue Aug 25, 2024 that may be closed by this pull request

Add docs on duplicate structures in pre-train dataset #101

Closed

n0w0f approved these changes Aug 26, 2024

View reviewed changes

n0w0f merged commit d435e66 into main Aug 26, 2024
1 of 2 checks passed

n0w0f deleted the deduplication-docs branch August 26, 2024 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add warning about deduplication #102

docs: add warning about deduplication #102

kjappelbaum commented Aug 25, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 25, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Aug 25, 2024

kjappelbaum Aug 25, 2024

sourcery-ai bot Aug 25, 2024

sourcery-ai bot Aug 25, 2024

n0w0f left a comment

docs: add warning about deduplication #102

docs: add warning about deduplication #102

Conversation

kjappelbaum commented Aug 25, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Aug 25, 2024 • edited Loading

Reviewer's Guide by Sourcery

File-Level Changes

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 25, 2024

Choose a reason for hiding this comment

kjappelbaum Aug 25, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 25, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 25, 2024

Choose a reason for hiding this comment

n0w0f left a comment

Choose a reason for hiding this comment

kjappelbaum commented Aug 25, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 25, 2024 •

edited

Loading