Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add thoth-archiving-network collection to OL import list #9413

Conversation

scottbarnes
Copy link
Collaborator

@scottbarnes scottbarnes commented Jun 8, 2024

Closes #9328

This changes the logic for the OL imports from IA to include items from the thoth-archiving-network collection.

Technical

thoth-archiving-network items lack:

  • a MARC record
  • a repub_state: field`;
  • a scanningcenter:* field; and
  • a scandate:* field.

Additionally, the indexdate parameter seems to stop thoth-archiving-network items from showing up in the results.

The updated query will match things that are in the thoth-archiving-network collection and meet the requirements aside from having a value for scanningcenter, scandate, or scanner (the later I left off just to be safe). It also does not require an indexdate, which I confess I could not determine the exact purpose of.

The query might have become a bit too clever in its string formatting. As such, rewriting it to be a bit more DRY may make it easier to read.

The fully constructed query, putting aside the changing dates, would look like:

(
    collection:thoth-archiving-network
    OR
    (
        (repub_state:4 OR repub_state:19 OR repub_state:20 OR repub_state:22)
        AND scanningcenter:* AND scanner:* AND scandate:* AND format:marc AND indexdate:2024-06-07*
    )
)
AND mediatype:texts AND !collection:opensource AND !collection:additional_collections AND !collection:litigationworks AND !noindex:true AND !is_dark:true AND format:pdf AND addeddate:[2024-04-08 TO 2024-06-08]

Testing

The original query would be something like this, which returns 974 docs:
https://archive.org/advancedsearch.php?q=mediatype%3Atexts+AND+%28repub_state%3A4+OR+repub_state%3A19+OR+repub_state%3A20+OR+repub_state%3A22%29+AND+scanningcenter%3A*+AND+scanner%3A*+AND+scandate%3A*+AND+%21collection%3Aopensource+AND+%21collection%3Aadditional_collections+AND+%21collection%3Alitigationworks+AND+%21noindex%3Atrue+AND+%21is_dark%3Atrue+AND+format%3Apdf+AND+indexdate%3A2024-06-07*+AND+addeddate%3A%5B2024-04-08+TO+2024-06-08%5D+AND+format%3Amarc&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=5000&page=1&output=json&callback=callback&save=yes

With the change, the query returns 995 results becomes:
https://archive.org/advancedsearch.php?q=%28collection%3Athoth-archiving-network+OR+%28%28repub_state%3A4+OR+repub_state%3A19+OR+repub_state%3A20+OR+repub_state%3A22%29+AND+scanningcenter%3A%2A+AND+scanner%3A%2A+AND+scandate%3A%2A+AND+format%3Amarc+AND+indexdate%3A2024-06-07%2A%29%29+AND+mediatype%3Atexts+AND+%21collection%3Aopensource+AND+%21collection%3Aadditional_collections+AND+%21collection%3Alitigationworks+AND+%21noindex%3Atrue+AND+%21is_dark%3Atrue+AND+format%3Apdf+AND+addeddate%3A%5B2024-04-08+TO+2024-06-08%5D&fl=identifier%2Cformat&service=metadata__unlimited&rows=100000&output=json&callback=callback&save=yes

Further, there are 21 results searching for collection:thoth-archiving-network for the relevant dates:
https://archive.org/advancedsearch.php?q=collection%3Athoth-archiving-network+AND+addeddate%3A%5B2024-04-08+TO+2024-06-08%5D+format%3Apdf&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=callback&save=yes

974 + 21 = 995. Searching the updated query results shows the thoth identifiers in there, e.g. 6f0a56be-6b4f-43c7-a775-4167cbea9504 and b0f368a7-40d5-45f1-9f01-45a84f14da85.

As an additional observation, it's not clear to me scanningcenter:*, scanner:*, or scandate:* changes the non-thoth results.

Screenshot

Stakeholders

@mekarpeles

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] labels Jun 10, 2024
scottbarnes and others added 2 commits June 10, 2024 16:55
This changes the logic for the OL imports from IA to include items from
the thoth-archiving-network.

Co-authored-by: Drini Cami <cdrini@gmail.com>
Co-authored-by: Mek <michael.karpeles@gmail.com>
@scottbarnes scottbarnes force-pushed the feature/9328/add-thoth-archiving-network-to-ol-import-list branch from a2c4cf4 to 6f26fbb Compare June 10, 2024 23:55
@github-actions github-actions bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jun 10, 2024
@scottbarnes scottbarnes force-pushed the feature/9328/add-thoth-archiving-network-to-ol-import-list branch from 63851bc to 42c26cb Compare June 11, 2024 00:00
@scottbarnes
Copy link
Collaborator Author

@mekarpeles, the substantive changes can be seen at 6f26fbb; the most recent commit is merely one for linting, which obscured the substantive changes.

I double checked the code by querying with the old and new query against two days: one without any Thoth imports and one with.

  • June 6th (no Thoth imports): the old and new queries both return 4,827 results.
  • June 5th (three Thoth imports): the old query returns 106 results and the new query returns 109, with the difference being the correct three Thoth imports are now included.

19d21671-246e-4387-9ac7-7e1a938240f9 is a Thoth ocaid that was imported on June 5th.

@mekarpeles mekarpeles merged commit 7272e09 into internetarchive:master Jun 11, 2024
4 checks passed
@scottbarnes scottbarnes deleted the feature/9328/add-thoth-archiving-network-to-ol-import-list branch June 11, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding OA collection thoth-archiving-network to OL Import List
2 participants