Automate Article Collection and README Updates from Blog Source #1208

sikehish · 2024-10-26T22:24:43Z

This pull request introduces a script for fetching articles from a specified URL(https://blog.dagworks.io/archive), processing the data, and updating the README file with the fetched articles. It addresses issue #1046.

Changes

Added a script to fetch articles using Selenium and BeautifulSoup.
Implemented a function to get a cutoff date from user input.
Included functionality to update a README file with newly fetched articles.
Added error handling for date input.
Included a requirements file for dependencies.

How I tested this

Verified that the script correctly fetches articles from the specified URL and logs the output.
Tested the cutoff date input to ensure it correctly formats and validates the date.
Confirmed that the README file updates with the new articles correctly and maintains the existing content.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

ellipsis-dev

❌ Changes requested. Reviewed everything up to 34f80c6 in 18 seconds

More details

Looked at 166 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 6 drafted comments based on config settings.

1. scripts/update_blogs_in_learning_resources.py:20

Draft comment:
Consider using an HTTP library like requests instead of Selenium for fetching articles if JavaScript execution is not required. This can improve performance and reduce resource usage.
Reason this comment was not posted:
Confidence changes required: 50%
The script uses Selenium to scrape a webpage, which can be resource-intensive and slow. It would be more efficient to use an HTTP library like requests if the page content can be accessed without JavaScript execution.

2. scripts/update_blogs_in_learning_resources.py:110

Draft comment:
Consider creating a backup of the README file before modifying it to prevent data loss in case of an error.
Reason this comment was not posted:
Confidence changes required: 50%
The script directly modifies the README file without creating a backup. It's a good practice to create a backup before making changes to important files.

3. scripts/update_blogs_in_learning_resources.py:45

Draft comment:
Avoid using hardcoded blog links if they can be fetched dynamically. This ensures the script always has the most up-to-date information.
Reason this comment was not posted:
Confidence changes required: 50%
The script uses hardcoded blog links which might not be necessary if the links can be fetched dynamically. This could lead to outdated information if the hardcoded links change.

4. scripts/update_blogs_in_learning_resources.py:20

Draft comment:
The fetch_articles function is handling both fetching and processing articles. Consider separating the fetching logic from the processing logic to adhere to the Single Responsibility Principle.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is suggesting a refactor to improve adherence to the Single Responsibility Principle, which is a good practice. The function indeed handles two distinct tasks: fetching data and processing it. Separating these tasks could make the code cleaner and easier to maintain. This aligns with the rules for suggesting code quality refactors.
The function is not overly complex, and the current implementation might be sufficient for its purpose. The suggested refactor could be seen as unnecessary if the function's current form is clear and maintainable.
While the function is not overly complex, adhering to the Single Responsibility Principle is generally beneficial for future maintenance and scalability. The suggestion is reasonable and aligns with best practices.
The comment is valid as it suggests a reasonable refactor to adhere to the Single Responsibility Principle. It should be kept as it provides actionable advice for improving code quality.

5. scripts/update_blogs_in_learning_resources.py:77

Draft comment:
Ensure function names follow a consistent naming pattern. For example, use get_cutoff_date consistently with other functions like fetch_articles and update_readme.
Reason this comment was not posted:
Comment did not seem useful.

6. scripts/update_blogs_in_learning_resources.py:1

Draft comment:
Consider adding this script to the Sphinx documentation under docs/ to provide users with guidance on its purpose and usage.
Reason this comment was not posted:
Confidence changes required: 50%
The script should be documented in the Sphinx documentation under docs/ to help users understand its purpose and usage.

Workflow ID: wflow_xRPaLBAeGbNhfbbM

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2024-10-26T22:25:12Z

scripts/update_blogs_in_learning_resources.py

+        return datetime.strptime(date_str, "%Y-%m-%d").date()
+    except ValueError:
+        print("Invalid date format. Please use YYYY-MM-DD.")
+        return get_cutoff_date()


Using recursion for input validation can lead to a stack overflow. Consider using a loop to repeatedly prompt for input until a valid date is entered.

scripts/update_blogs_in_learning_resources.py

skrawcz · 2024-10-26T22:46:18Z

Awesome, looking forward to trying this tomorrow / Monday!

Otherwise one request -- could you include a "print to standard out" option please?

sikehish · 2024-10-27T05:48:12Z

Awesome, looking forward to trying this tomorrow / Monday!

Otherwise one request -- could you include a "print to standard out" option please?

Hi @skrawcz. I've added that option. I've also added a docstring in the python script file so that it would be easier for the users to get started with it.

skrawcz · 2024-10-28T04:51:32Z

@sikehish you need to:

squash these changes down to a single commit (this will make rebasing simpler)
rebase.

Then I can merge this. Otherwise tested locally and it works, thank you!

sikehish · 2024-10-29T06:40:09Z

Hi @skrawcz
I tried squashing and rebasing all commits...Is this alright?
I've created a backup branch where I've cherry picked all the commits relevant to this PR(the 6-7 commits relevant to the issue). I can create another PR from the other branch and close this if this looks messed up.

author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182901 +0530 parent 7b9f52a author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182885 +0530 parent 7b9f52a author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182883 +0530 parent 7b9f52a author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182877 +0530 parent 7b9f52a author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182862 +0530 parent 7b9f52a author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182842 +0530 parent 7b9f52a author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100 committer sikehish <hisham0502@gmail.com> 1730182748 +0530 Update materialization.rst Fix mutate docstring Docstring raised Pytest escape error due to /* so we wrap it in quotes. Improve pipe_output first node naming Current name convention is very prone to name clashes with user naming. We assign the same naming convention using namespace.raw to indicate the first node. Provide dockerx setup for docker builds (DAGWorks-Inc#1194) This change adds a script for multi-platform docker builds as well as a github workflow to automatically build them when a new sf-hamilton-ui version has been published. Squashed commits: * Created buildx_and_push.sh script in ui directory to create multi-platform docker builds * Fixed content related issue in buildx_and_push.sh * buildx_and_push.sh: Added functionality to fecth the latest version from PyPi * buildx_and_push.sh: Added check_buildx_installed * buildx_and_push.sh: Added check_buildx_installed * buildx_and_push.sh: Enhanced error handling(checking if jq exists, curl response handling and docker buildx error handling) * Adding build args to buildx_and_push.sh * buildx_and_push.sh: Changes made to test a new workflow * Created a new workflow: hamilton-ui-build-and-push * buildx_and_push.sh: echo statement added to debug * buildx_and_push.sh: cd'ing to the directory(ui) where the shell script is located to prevent context related errors in workflow. * hamilton-ui-build-and-push.yml: Added Docker Hub login step * buildx_and_push.sh: added dagworks dockerhub username; workflow worked on the fork. * hamilton-ui-build-and-push.yml: Replaced previous version from cache with version tag from Dockerhub image. * buildx_and_push.sh: Changed dockerhub username for testing * hamilton-ui-build-and-push.yml: Minor change in the docker registry URL(version) * hamilton-ui-build-and-push.yml: Minor change in the "Fetch highest version from Docker Hub" step's shell script * hamilton-ui-build-and-push.yml: Replacing deprecated set-output with GITHUB_OUTPUT * hamilton-ui-build-and-push.yml: Conditional execution of steps implemented * Undid change in dockerhub username * Update ui/buildx_and_push.sh * Update ui/buildx_and_push.sh * chore: fix pre-commit whitespace issues --------- Co-authored-by: Stefan Krawczyk <stefan@dagworks.io> Fix `keep_dot` propagation in `Driver` display functions Bumps hamilton version from 1.81.0 to 1.81.1 fix: caching `SQLiteMetadataStore.get_run_ids()` (DAGWorks-Inc#1205) * fixed .get_run_ids() and standardized .get_run() + tests * fixed docstrings formatting errors --------- Co-authored-by: zilto <tjean@DESKTOP-V6JDCS2> Bumps hamilton version from 1.81.1 to 1.81.2 buildx_and_push.sh: Changes made to test a new workflow buildx_and_push.sh: echo statement added to debug buildx_and_push.sh: added dagworks dockerhub username; workflow worked on the fork. buildx_and_push.sh: Changed dockerhub username for testing Undid change in dockerhub username Created a new update external blogs script which updates External Blogs section in learning_resources.md with the latest blogs(with a date cutoff) Added docstring in update_blogs_in_learning_resources.py chore: fix pre-commit whitespace issues update_blogs_in_learning_resources.py: Added a "print to standard out" option (using --print flag) update_blogs_in_learning_resources.py: Argument parser with expaned help text update_blogs_in_learning_resources.py: Expanded docstring Fixes whitespace and small docs typo buildx_and_push.sh: Changes made to test a new workflow buildx_and_push.sh: echo statement added to debug hamilton-ui-build-and-push.yml: Added Docker Hub login step buildx_and_push.sh: added dagworks dockerhub username; workflow worked on the fork. buildx_and_push.sh: Changed dockerhub username for testing hamilton-ui-build-and-push.yml: Minor change in the "Fetch highest version from Docker Hub" step's shell script hamilton-ui-build-and-push.yml: Replacing deprecated set-output with GITHUB_OUTPUT chore: fix pre-commit whitespace issues Update ui/buildx_and_push.sh Update ui/buildx_and_push.sh Fixes whitespace and small docs typo

skrawcz · 2024-10-31T04:30:05Z

Thanks @sikehish !

ellipsis-dev bot reviewed Oct 26, 2024

View reviewed changes

sikehish mentioned this pull request Oct 26, 2024

Documentation: update references to external blogs on Hamilton #1046

Closed

skrawcz approved these changes Oct 28, 2024

View reviewed changes

skrawcz linked an issue Oct 28, 2024 that may be closed by this pull request

Documentation: update references to external blogs on Hamilton #1046

Closed

sikehish force-pushed the main2 branch from 6a33259 to 8a2870e Compare October 29, 2024 06:24

skrawcz force-pushed the main2 branch from 8a2870e to 63429cc Compare October 30, 2024 18:42

skrawcz merged commit fd28dc0 into DAGWorks-Inc:main Oct 31, 2024
24 checks passed

skrawcz added hacktoberfest Hacktoberfest issues hacktoberfest2024 hacktoberfest 2024 labels Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate Article Collection and README Updates from Blog Source #1208

Automate Article Collection and README Updates from Blog Source #1208

sikehish commented Oct 26, 2024

ellipsis-dev bot left a comment

ellipsis-dev bot Oct 26, 2024

skrawcz commented Oct 26, 2024

sikehish commented Oct 27, 2024 •

edited

Loading

skrawcz commented Oct 28, 2024

sikehish commented Oct 29, 2024 •

edited

Loading

skrawcz commented Oct 31, 2024

Automate Article Collection and README Updates from Blog Source #1208

Automate Article Collection and README Updates from Blog Source #1208

Conversation

sikehish commented Oct 26, 2024

Changes

How I tested this

Checklist

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot Oct 26, 2024

Choose a reason for hiding this comment

skrawcz commented Oct 26, 2024

sikehish commented Oct 27, 2024 • edited Loading

skrawcz commented Oct 28, 2024

sikehish commented Oct 29, 2024 • edited Loading

skrawcz commented Oct 31, 2024

sikehish commented Oct 27, 2024 •

edited

Loading

sikehish commented Oct 29, 2024 •

edited

Loading