-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add warning about deduplication #102
Conversation
Reviewer's Guide by SourceryThis pull request adds a warning about deduplication in the benchmarking documentation. It explains that the pretraining datasets in MatText are only deduplicated based on the CIF string, which may lead to data leakage issues in certain use cases. The PR also includes minor formatting and whitespace changes throughout the document. File-Level Changes
Tips
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kjappelbaum - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟡 Documentation: 3 issues found
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint | ||
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint | ||
``` | ||
|
||
|
||
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`. | ||
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`. | ||
You can define the parameters for the experiment hence at `\conf\model\benchmark_example.yaml`. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (documentation): Remove trailing whitespace
There are several instances of trailing whitespace throughout the document. Consider removing all trailing whitespace for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this into a suggestion I can apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly! Here's a suggestion you can apply to remove the trailing whitespace:
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint | |
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint | |
``` | |
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`. | |
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`. | |
You can define the parameters for the experiment hence at `\conf\model\benchmark_example.yaml`. | |
Here, for the benchmarking pipeline(`-cn=benchmark`) the base config is `benchmark_example.yaml`. |
You can apply this suggestion directly to the diff hunk. Let me know if you need further assistance!
```bash | ||
python main.py -cn=pretrain model=pretrain_example +model.representation=composition +model.dataset_type=pretrain30k +model.context_length=32 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (documentation): Standardize code block formatting
Some code blocks use triple backticks with a language specifier (e.g., ```bash), while others use just triple backticks. Consider standardizing all code blocks to include the language specifier for consistency and improved syntax highlighting.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Thanks @kjappelbaum
Summary by Sourcery
Add a warning in the documentation about the limitations of deduplication in pretraining datasets, advising users on potential data leakage issues.
Documentation: