Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate NLTK datapackage punkt_tab download #803

Merged
merged 4 commits into from
Sep 30, 2024

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Sep 19, 2024

Based on the update-dependencies-v1.2 branch (PR #796) because it upgrades NLTK to a version that needs the punkt_tab package (not punkt anymore): nltk/nltk#3283)

The download is performed automatically when instantiating an Analyzer object, if the punkt_tab data is not found. The README.md has been adjusted for this (i.e. removed the instructions to run python -m nltk.downloader punkt_tab). NB: Adjust Wiki pages too on release.

In Dockerfile this data download step is retained, so when using the Dockerimage everything is ready without download needs (otherwise the data would be re-downloaded every time Annif is used in a new container, because the NLTK data storage is not mounted to a volume).

Also in the CI/CD pipeline there is still this explicit download step.

BTW, now when modifying this, it would be a good time to consider changing the location of the NLTK data storage, which is discussed here. However, the current ~/nltk_data/ is not bad, I think.

@juhoinkinen juhoinkinen added this to the 1.2 milestone Sep 19, 2024
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 99.63%. Comparing base (53f16b1) to head (45be777).
Report is 38 commits behind head on main.

Files with missing lines Patch % Lines
annif/analyzer/analyzer.py 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #803      +/-   ##
==========================================
- Coverage   99.65%   99.63%   -0.02%     
==========================================
  Files          93       93              
  Lines        6889     6909      +20     
==========================================
+ Hits         6865     6884      +19     
- Misses         24       25       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@juhoinkinen
Copy link
Member Author

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

CodiumAI-Agent commented Sep 19, 2024

PR Reviewer Guide 🔍

(Review updated until commit f01d351)

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Key issues to review

Error Handling
The error handling for missing NLTK data in the __init__ method could be improved by re-raising the exception after logging, to avoid silent failures in environments where automatic downloads are disabled or fail.

Dependency Change
The change from keras.backend to keras.ops for the mean operation needs verification to ensure compatibility and expected behavior across different Keras versions.

@CodiumAI-Agent
Copy link

Persistent review updated to latest commit c047aea

@CodiumAI-Agent
Copy link

Persistent review updated to latest commit f01d351

@osma osma changed the base branch from main to update-dependencies-v1.2 September 23, 2024 10:39
Copy link

sonarcloud bot commented Sep 24, 2024

@juhoinkinen
Copy link
Member Author

One line is missing test coverage, but that line "should never be run" and the involved logic is very simple.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Base automatically changed from update-dependencies-v1.2 to main September 30, 2024 08:30
@juhoinkinen juhoinkinen marked this pull request as ready for review September 30, 2024 08:31
@juhoinkinen juhoinkinen merged commit 20ae1e4 into main Sep 30, 2024
12 of 17 checks passed
@juhoinkinen juhoinkinen deleted the automate-nltkdata-download branch September 30, 2024 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants