Feature: add thoth-archiving-network collection to OL import list #9413
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #9328
This changes the logic for the OL imports from IA to include items from the
thoth-archiving-network
collection.Technical
thoth-archiving-network
items lack:repub_state:
field`;scanningcenter:*
field; andscandate:*
field.Additionally, the
indexdate
parameter seems to stopthoth-archiving-network
items from showing up in the results.The updated query will match things that are in the
thoth-archiving-network
collection and meet the requirements aside from having a value forscanningcenter
,scandate
, orscanner
(the later I left off just to be safe). It also does not require anindexdate
, which I confess I could not determine the exact purpose of.The query might have become a bit too clever in its string formatting. As such, rewriting it to be a bit more DRY may make it easier to read.
The fully constructed query, putting aside the changing dates, would look like:
Testing
The original query would be something like this, which returns 974 docs:
https://archive.org/advancedsearch.php?q=mediatype%3Atexts+AND+%28repub_state%3A4+OR+repub_state%3A19+OR+repub_state%3A20+OR+repub_state%3A22%29+AND+scanningcenter%3A*+AND+scanner%3A*+AND+scandate%3A*+AND+%21collection%3Aopensource+AND+%21collection%3Aadditional_collections+AND+%21collection%3Alitigationworks+AND+%21noindex%3Atrue+AND+%21is_dark%3Atrue+AND+format%3Apdf+AND+indexdate%3A2024-06-07*+AND+addeddate%3A%5B2024-04-08+TO+2024-06-08%5D+AND+format%3Amarc&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=5000&page=1&output=json&callback=callback&save=yes
With the change, the query returns 995 results becomes:
https://archive.org/advancedsearch.php?q=%28collection%3Athoth-archiving-network+OR+%28%28repub_state%3A4+OR+repub_state%3A19+OR+repub_state%3A20+OR+repub_state%3A22%29+AND+scanningcenter%3A%2A+AND+scanner%3A%2A+AND+scandate%3A%2A+AND+format%3Amarc+AND+indexdate%3A2024-06-07%2A%29%29+AND+mediatype%3Atexts+AND+%21collection%3Aopensource+AND+%21collection%3Aadditional_collections+AND+%21collection%3Alitigationworks+AND+%21noindex%3Atrue+AND+%21is_dark%3Atrue+AND+format%3Apdf+AND+addeddate%3A%5B2024-04-08+TO+2024-06-08%5D&fl=identifier%2Cformat&service=metadata__unlimited&rows=100000&output=json&callback=callback&save=yes
Further, there are 21 results searching for
collection:thoth-archiving-network
for the relevant dates:https://archive.org/advancedsearch.php?q=collection%3Athoth-archiving-network+AND+addeddate%3A%5B2024-04-08+TO+2024-06-08%5D+format%3Apdf&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=callback&save=yes
974 + 21 = 995. Searching the updated query results shows the thoth identifiers in there, e.g.
6f0a56be-6b4f-43c7-a775-4167cbea9504
andb0f368a7-40d5-45f1-9f01-45a84f14da85
.As an additional observation, it's not clear to me
scanningcenter:*
,scanner:*
, orscandate:*
changes the non-thoth results.Screenshot
Stakeholders
@mekarpeles