Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up records with future publication dates #4568

Open
SaravgiYash opened this issue Feb 10, 2021 · 19 comments
Open

Clean up records with future publication dates #4568

SaravgiYash opened this issue Feb 10, 2021 · 19 comments
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Can it be closed? Data Cleanup Lead: @scottbarnes Issues overseen by Scott (Community Imports) Needs: Feedback A proposed feature or bug resolution needs community feedback prior to forging ahead. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]

Comments

@SaravgiYash
Copy link
Contributor

SaravgiYash commented Feb 10, 2021

Evidence / Screenshot (if possible)

Many works have wrong year of publication (Like 9999, 2049, 2040....)

See: https://openlibrary.org/search?q=publish_year%3A%5B2025+TO+*%5D

image

Relevant url?

https://openlibrary.org/search?q=mark&mode=everything&sort=new
https://openlibrary.org/works/OL21132031W/Classical_Music_Picture_Book?edition=
https://openlibrary.org/works/OL21486637W/Making_Sense_of_Politics?edition=

Details

  • Logged in (Y/N)? Y
  • Browser type/version? Chrome
  • Operating system? Windows 10
  • Environment (prod/dev/local)? prod

Proposal

Use first_publish_year:[2025 TO *] in solr, e.g. https://openlibrary.org/search.json?q=first_publish_year%3A%5B2025+TO+*%5D, to find future dates

  • Run a json solr query to get all ~12k records with future dates
  • Write a small script to fetch and update these records by removing the future dates

Stakeholders

@SaravgiYash SaravgiYash added Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Bug Something isn't working. [managed] labels Feb 10, 2021
@Bhavna777
Copy link

I would like to work on this issue.

@Bhavna777
Copy link

Can I start ?

@SaravgiYash
Copy link
Contributor Author

I would like to work on this issue.

Okay, but it would be better if you choose one issue at a time and after filing a PR for that issue you can start working on it.

@Bhavna777
Copy link

Ok Thank You

@mekarpeles mekarpeles added Lead: @seabelis Issuses overseen by Lisa (Staff: Lead Community Librarian) [managed] and removed Needs: Lead labels Feb 17, 2021
@Bhavna777
Copy link

@Yashs911 Can you please help me to solve this issue
Actually I'm newbie to Open Library, That's why I could not find the file where I should change 😁

@SaravgiYash
Copy link
Contributor Author

@Bhavna777 Actually, I don't know the root cause, so I don't know where we should start. As per internetarchive/openlibrary-librarians#1 and some other issues linked to this. I will suggest that we hide the publication year >= 2021 for the time being.

@mekarpeles mekarpeles added Priority: 3 Issues that we can consider at our leisure. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Mar 2, 2021
@seabelis
Copy link
Collaborator

seabelis commented Mar 2, 2021

Added to librarians repo for manual correction. internetarchive/openlibrary-librarians#53

@SaravgiYash
Copy link
Contributor Author

SaravgiYash commented Mar 3, 2021

@seabelis Actually this issue is not just related to https://openlibrary.org/search?q=mark&mode=everything&sort=new but many books on OL have the wrong publication year so I was wondering if it was possible to hide publications year > 2021

@Bhavna777
Copy link

@seabelis Actually this issue is not just related to https://openlibrary.org/search?q=mark&mode=everything&sort=new but many books on OL have the wrong publication year so I was wondering if it was possible to hide publications year > 2021

But it will create problem in the upcoming years.

@SaravgiYash
Copy link
Contributor Author

But it will create problem in the upcoming years.

By 2021 I meant we can use Current Year function

@seabelis
Copy link
Collaborator

seabelis commented Mar 4, 2021

I'm not the person to decide, but I'd prefer to delete the incorrect data than to hide it.

@mekarpeles
Copy link
Member

@scottbarnes can you confirm whether this can be closed now re: 9999?

@mekarpeles mekarpeles added Needs: Feedback A proposed feature or bug resolution needs community feedback prior to forging ahead. [managed] Can it be closed? labels Oct 17, 2023
@mekarpeles mekarpeles changed the title Wrong Year of Publication Clean up future publication dates Oct 17, 2023
@mekarpeles
Copy link
Member

mekarpeles commented Oct 17, 2023

I'm re-purposing this issue to clean up works that have future dates.

https://openlibrary.org/query.json?type=/type/edition&publish_date~=9999*&limit=1000

or first_publish_year:[2025 TO *] in solr, e.g.:

https://openlibrary.org/search?q=first_publish_year%3A%5B2025+TO+*%5D&mode=everything&sort=new

Proposal

  • Run a json solr query to get all ~12k records with future dates
  • Write a small script to fetch and update these records by removing the future dates

@mekarpeles mekarpeles changed the title Clean up future publication dates Clean up records with future publication dates Oct 17, 2023
@mekarpeles mekarpeles added Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 2 Important, as time permits. [managed] and removed Lead: @seabelis Issuses overseen by Lisa (Staff: Lead Community Librarian) [managed] Can it be closed? Priority: 3 Issues that we can consider at our leisure. [managed] labels Oct 17, 2023
@mekarpeles mekarpeles added this to the Sprint 2023-10 milestone Oct 17, 2023
@mekarpeles mekarpeles added Data Cleanup Affects: Data Issues that affect book/author metadata or user/account data. [managed] labels Oct 17, 2023
@scottbarnes
Copy link
Collaborator

It may be helpful to keep a record of items we've so modified in case we later want to go back and, for example, reimport them or otherwise modify them further, and this way it will be easy to identify the ones from which we've removed publish_date.

@cdrini cdrini assigned cdrini, scottbarnes and hornc and unassigned cdrini and scottbarnes Oct 18, 2023
@cdrini
Copy link
Collaborator

cdrini commented Oct 18, 2023

@hornc notes that he is planning on removing all the 9999 dates in a bulk process. I believe this would tackle the bulk of the problem let us see...

@cdrini
Copy link
Collaborator

cdrini commented Oct 18, 2023

There are about 5,868 editions with publish year 9999, and another 15,707 with publish years after 2025 but not 9999. Flipping through them it's unclear why exactly they have these weird dates and whether they should be deleted 😕 I think fixing the 9999 set is a good first stab. Would you be able to keep a list of the editions your script edits, and upload it to the issue? We might want to do further investigation on these editions later, and having a way to find them would be useful!

@hornc
Copy link
Collaborator

hornc commented Oct 18, 2023

One cause of the 9999 problem relates to MARC imports and the existing issue: #2711 I started cleanup and noticed a number of 9999 dates originate from Harvard MARC records where the 9999 is in the 008 field, but there is a correct publication date (often) in 260$c

https://openlibrary.org/books/OL45340001M/%CA%BBAlimi_aman_jo_Islami_manshur?m=history

and

https://openlibrary.org/show-records/harvard_bibliographic_metadata/ab.bib.12.20150123.full.mrc:583443956:436

I'll see if there is a way to easily add the correct dates as a go, and look at patching the MARC import hole. -->

See PR: #8448

@mekarpeles mekarpeles added Priority: 3 Issues that we can consider at our leisure. [managed] and removed Priority: 2 Important, as time permits. [managed] labels Nov 6, 2023
@hornc
Copy link
Collaborator

hornc commented Nov 20, 2023

@mekarpeles I believe all the 9999 dates have been removed from Open Library.

@hornc
Copy link
Collaborator

hornc commented Nov 21, 2023

@mekarpeles mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] labels Mar 8, 2024
@hornc hornc removed their assignment Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Can it be closed? Data Cleanup Lead: @scottbarnes Issues overseen by Scott (Community Imports) Needs: Feedback A proposed feature or bug resolution needs community feedback prior to forging ahead. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

7 participants