Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate Article Collection and README Updates from Blog Source #1208

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

sikehish
Copy link
Contributor

This pull request introduces a script for fetching articles from a specified URL(https://blog.dagworks.io/archive), processing the data, and updating the README file with the fetched articles. It addresses issue #1046.

Changes

  • Added a script to fetch articles using Selenium and BeautifulSoup.
  • Implemented a function to get a cutoff date from user input.
  • Included functionality to update a README file with newly fetched articles.
  • Added error handling for date input.
  • Included a requirements file for dependencies.

How I tested this

  • Verified that the script correctly fetches articles from the specified URL and logs the output.
  • Tested the cutoff date input to ensure it correctly formats and validates the date.
  • Confirmed that the README file updates with the new articles correctly and maintains the existing content.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 34f80c6 in 18 seconds

More details
  • Looked at 166 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 6 drafted comments based on config settings.
1. scripts/update_blogs_in_learning_resources.py:20
  • Draft comment:
    Consider using an HTTP library like requests instead of Selenium for fetching articles if JavaScript execution is not required. This can improve performance and reduce resource usage.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The script uses Selenium to scrape a webpage, which can be resource-intensive and slow. It would be more efficient to use an HTTP library like requests if the page content can be accessed without JavaScript execution.
2. scripts/update_blogs_in_learning_resources.py:110
  • Draft comment:
    Consider creating a backup of the README file before modifying it to prevent data loss in case of an error.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The script directly modifies the README file without creating a backup. It's a good practice to create a backup before making changes to important files.
3. scripts/update_blogs_in_learning_resources.py:45
  • Draft comment:
    Avoid using hardcoded blog links if they can be fetched dynamically. This ensures the script always has the most up-to-date information.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The script uses hardcoded blog links which might not be necessary if the links can be fetched dynamically. This could lead to outdated information if the hardcoded links change.
4. scripts/update_blogs_in_learning_resources.py:20
  • Draft comment:
    The fetch_articles function is handling both fetching and processing articles. Consider separating the fetching logic from the processing logic to adhere to the Single Responsibility Principle.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment is suggesting a refactor to improve adherence to the Single Responsibility Principle, which is a good practice. The function indeed handles two distinct tasks: fetching data and processing it. Separating these tasks could make the code cleaner and easier to maintain. This aligns with the rules for suggesting code quality refactors.
    The function is not overly complex, and the current implementation might be sufficient for its purpose. The suggested refactor could be seen as unnecessary if the function's current form is clear and maintainable.
    While the function is not overly complex, adhering to the Single Responsibility Principle is generally beneficial for future maintenance and scalability. The suggestion is reasonable and aligns with best practices.
    The comment is valid as it suggests a reasonable refactor to adhere to the Single Responsibility Principle. It should be kept as it provides actionable advice for improving code quality.
5. scripts/update_blogs_in_learning_resources.py:77
  • Draft comment:
    Ensure function names follow a consistent naming pattern. For example, use get_cutoff_date consistently with other functions like fetch_articles and update_readme.
  • Reason this comment was not posted:
    Comment did not seem useful.
6. scripts/update_blogs_in_learning_resources.py:1
  • Draft comment:
    Consider adding this script to the Sphinx documentation under docs/ to provide users with guidance on its purpose and usage.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The script should be documented in the Sphinx documentation under docs/ to help users understand its purpose and usage.

Workflow ID: wflow_xRPaLBAeGbNhfbbM


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

return datetime.strptime(date_str, "%Y-%m-%d").date()
except ValueError:
print("Invalid date format. Please use YYYY-MM-DD.")
return get_cutoff_date()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using recursion for input validation can lead to a stack overflow. Consider using a loop to repeatedly prompt for input until a valid date is entered.

scripts/update_blogs_in_learning_resources.py Show resolved Hide resolved
@skrawcz
Copy link
Collaborator

skrawcz commented Oct 26, 2024

Awesome, looking forward to trying this tomorrow / Monday!

Otherwise one request -- could you include a "print to standard out" option please?

@sikehish
Copy link
Contributor Author

sikehish commented Oct 27, 2024

Awesome, looking forward to trying this tomorrow / Monday!

Otherwise one request -- could you include a "print to standard out" option please?

Hi @skrawcz. I've added that option. I've also added a docstring in the python script file so that it would be easier for the users to get started with it.

@skrawcz skrawcz linked an issue Oct 28, 2024 that may be closed by this pull request
@skrawcz
Copy link
Collaborator

skrawcz commented Oct 28, 2024

@sikehish you need to:

  1. squash these changes down to a single commit (this will make rebasing simpler)
  2. rebase.

Then I can merge this. Otherwise tested locally and it works, thank you!

@sikehish
Copy link
Contributor Author

sikehish commented Oct 29, 2024

Hi @skrawcz
I tried squashing and rebasing all commits...Is this alright?
I've created a backup branch where I've cherry picked all the commits relevant to this PR(the 6-7 commits relevant to the issue). I can create another PR from the other branch and close this if this looks messed up.

author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182901 +0530

parent 7b9f52a
author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182885 +0530

parent 7b9f52a
author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182883 +0530

parent 7b9f52a
author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182877 +0530

parent 7b9f52a
author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182862 +0530

parent 7b9f52a
author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182842 +0530

parent 7b9f52a
author Dimitry Foures <dimitry.foures@gmail.com> 1729160952 +0100
committer sikehish <hisham0502@gmail.com> 1730182748 +0530

Update materialization.rst

Fix mutate docstring

Docstring raised Pytest escape error due to /* so we wrap it in quotes.

Improve pipe_output first node naming

Current name convention is very prone to name clashes with user naming.

We assign the same naming convention using namespace.raw to indicate the
first node.

Provide dockerx setup for docker builds (DAGWorks-Inc#1194)

This change adds a script for multi-platform docker builds as
well as a github workflow to automatically build them when a new sf-hamilton-ui version
has been published.

Squashed commits:

* Created buildx_and_push.sh script in ui directory to create multi-platform docker builds

* Fixed content related issue in buildx_and_push.sh

* buildx_and_push.sh: Added functionality to fecth the latest version from PyPi

* buildx_and_push.sh: Added check_buildx_installed

* buildx_and_push.sh: Added check_buildx_installed

* buildx_and_push.sh: Enhanced error handling(checking if jq exists, curl response handling and docker buildx error handling)

* Adding build args to buildx_and_push.sh

* buildx_and_push.sh: Changes made to test a new workflow

* Created a new workflow: hamilton-ui-build-and-push

* buildx_and_push.sh: echo statement added to debug

* buildx_and_push.sh: cd'ing to the directory(ui) where the shell script is located to prevent context related errors in workflow.

* hamilton-ui-build-and-push.yml: Added Docker Hub login step

* buildx_and_push.sh: added dagworks dockerhub username; workflow worked on the fork.

* hamilton-ui-build-and-push.yml: Replaced previous version from cache with version tag from Dockerhub image.

* buildx_and_push.sh: Changed dockerhub username for testing

* hamilton-ui-build-and-push.yml: Minor change in the docker registry URL(version)

* hamilton-ui-build-and-push.yml: Minor change in the "Fetch highest version from Docker Hub" step's shell script

* hamilton-ui-build-and-push.yml: Replacing deprecated set-output with GITHUB_OUTPUT

* hamilton-ui-build-and-push.yml: Conditional execution of steps implemented

* Undid change in dockerhub username

* Update ui/buildx_and_push.sh

* Update ui/buildx_and_push.sh

* chore: fix pre-commit whitespace issues

---------

Co-authored-by: Stefan Krawczyk <stefan@dagworks.io>

Fix `keep_dot` propagation in `Driver` display functions

Bumps hamilton version from 1.81.0 to 1.81.1

fix: caching `SQLiteMetadataStore.get_run_ids()` (DAGWorks-Inc#1205)

* fixed .get_run_ids() and standardized .get_run() + tests

* fixed docstrings formatting errors

---------

Co-authored-by: zilto <tjean@DESKTOP-V6JDCS2>

Bumps hamilton version from 1.81.1 to 1.81.2

buildx_and_push.sh: Changes made to test a new workflow

buildx_and_push.sh: echo statement added to debug

buildx_and_push.sh: added dagworks dockerhub username; workflow worked on the fork.

buildx_and_push.sh: Changed dockerhub username for testing

Undid change in dockerhub username

Created a new update external blogs script which updates External Blogs section in learning_resources.md with the latest blogs(with a date cutoff)

Added docstring in update_blogs_in_learning_resources.py

chore: fix pre-commit whitespace issues

update_blogs_in_learning_resources.py: Added a "print to standard out" option (using --print flag)

update_blogs_in_learning_resources.py: Argument parser with expaned help text

update_blogs_in_learning_resources.py: Expanded docstring

Fixes whitespace and small docs typo

buildx_and_push.sh: Changes made to test a new workflow

buildx_and_push.sh: echo statement added to debug

hamilton-ui-build-and-push.yml: Added Docker Hub login step

buildx_and_push.sh: added dagworks dockerhub username; workflow worked on the fork.

buildx_and_push.sh: Changed dockerhub username for testing

hamilton-ui-build-and-push.yml: Minor change in the "Fetch highest version from Docker Hub" step's shell script

hamilton-ui-build-and-push.yml: Replacing deprecated set-output with GITHUB_OUTPUT

chore: fix pre-commit whitespace issues

Update ui/buildx_and_push.sh

Update ui/buildx_and_push.sh

Fixes whitespace and small docs typo
@skrawcz skrawcz merged commit fd28dc0 into DAGWorks-Inc:main Oct 31, 2024
24 checks passed
@skrawcz skrawcz added hacktoberfest Hacktoberfest issues hacktoberfest2024 hacktoberfest 2024 labels Oct 31, 2024
@skrawcz
Copy link
Collaborator

skrawcz commented Oct 31, 2024

Thanks @sikehish !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest Hacktoberfest issues hacktoberfest2024 hacktoberfest 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation: update references to external blogs on Hamilton
3 participants