Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts/cron_watcher.py #6366

Merged
merged 2 commits into from
Apr 21, 2022
Merged

Conversation

cclauss
Copy link
Contributor

@cclauss cclauss commented Apr 2, 2022

A script that checks to see that our cron jobs have competed their respective tasks.

Currently it checks to see if the ol-dump and ol-cdumps for the previous month are stored at

Long term, the script should cover more cron jobs:
Daily Cron-audit task (Python) sentry (who watches the watchers)
If dump and cdump for last YYYY-MM were not uploaded to archive.org
Or if sitemaps updated for this YYYY-MM were not on transferred to ol-www
Or if partner dumps for this YYYY-MM were not uploaded to archive.org
Or if there have been no imports in last 48 hours (i.e. 2 days)
Or if DD>17 for YYYY-MM and bwb batchname doesn’t exist in import psql table
Then send daily email with failures only or slack failures

Sentry cron-jobs:

https://sentry.archive.org/organizations/ia-ux/projects/ol-cron-jobs

Technical

Testing

  1. Run the script. Should print True and return 0
  2. Pass in garbage for yyyy_mm. Should print False and return 1.
  3. Change the URL. Should print False and return 1.

Screenshot

Stakeholders

@cclauss cclauss requested review from cdrini and mekarpeles April 2, 2022 14:30
@cclauss cclauss added this to the Next (proposed) milestone Apr 2, 2022
@jimman2003
Copy link
Contributor

jimman2003 commented Apr 2, 2022

I believe that we should use the ia library to fetch the newest dump, eliminating the need for html scraping

@jimman2003
Copy link
Contributor

jimman2003 commented Apr 2, 2022

I suggest we rewrite find_last_months_dumps_on_ia as

from internetarchive import search_items
from datetime import date, timedelta
last_day_of_last_month = date.today().replace(day=1) - timedelta(days=1)
yyyy_mm = f"{last_day_of_last_month:%Y-%m}"
def find_last_months_dumps_on_ia(yyyy_mm: str = yyyy_mm) -> bool:
    """
    Return True if both ol_dump_yyyy and ol_cdump_yyyy files have been saved on the
    Internet Archive.
    """
    prefixes = (f"ol_dump_{yyyy_mm}", f"ol_cdump_{yyyy_mm}")
    found=0
    for item in search_items("collection:ol_exports"):
        if item["identifier"].startswith(prefixes):
            found+=1
        if found >= 2:
            break
    return found>=2

@cclauss
Copy link
Contributor Author

cclauss commented Apr 2, 2022

Unfortunately, the current internetarchive is based on requests so it only supports old style synchronous calls. Given that we eventually want to add at least 4 more network tests, I was planning to use asyncio.

@jimman2003
Copy link
Contributor

ok, maybe we should talk to api directly then (with httpx)?

@mekarpeles
Copy link
Member

I think these calls should be negligible in cost and using the IA tool seems right

@mekarpeles mekarpeles self-assigned this Apr 4, 2022
@mekarpeles mekarpeles added the Priority: 2 Important, as time permits. [managed] label Apr 4, 2022
@cclauss
Copy link
Contributor Author

cclauss commented Apr 7, 2022

&& || ; in cron?

@mekarpeles mekarpeles merged commit d552919 into internetarchive:master Apr 21, 2022
@cclauss cclauss deleted the cronwatcher branch April 22, 2022 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Data dumps Priority: 2 Important, as time permits. [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants