Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-architect openlibrary.solr / update_work for easier expansion #8618

Merged
merged 28 commits into from
Dec 14, 2023

Conversation

cdrini
Copy link
Collaborator

@cdrini cdrini commented Dec 12, 2023

Re-architect the openlibrary.solr.

Important

Highly recommend going commit-by-commit on this one for review. A lot of the commits will be smoother sailing of moving things around.

  • Normalized AbstractSolrBuilder classes ; classes that use @property's to convert db objects to solr objects.
  • Normalized openlibrary.solr.updater.AbstractSolrUpdater classes that fetch various adjacent data before passing to the solr builders.
    • Not crazy with this one; I think it could potentially be dissolved. This was just the easiest stepping stone from where we were to something a little more organized.

Note:

  • The naming is a bit confusing here, since solr_builder is also used for doing solr full reindexes

Technical

Possible risks:

  • No longer using ia_loaded_id from metadata
  • No longer sorting edition by publish year before processing

Testing

  • ✅ Indexing single records works; see below
  • ✅ Making edits on localhost results in edits being available to search
PYTHONPATH=. python ./openlibrary/solr/update.py --ol-url 'localhost:8080' --ol-config './config/openlibrary.yml' --solr-next --update pprint '/authors/OL18319A'
{"add": {
    "doc": {
        "alternate_names": [
            "Mark TWAIN",
            "M. Twain",
            "TWAIN",
            "Twain",
            "Twain, Mark (pseud)",
            "Twain, Mark (Spirit)",
            "Twain, Mark, 1835-1910",
            "Mark (Samuel L. Clemens) Twain",
            "Samuel Langhorne Clemens (Mark Twain)",
            "Samuel Langhorne Clemens",
            "mark twain"
        ],
        "birth_date": "30 November 1835",
        "death_date": "21 April 1910",
        "key": "/authors/OL18319A",
        "name": "Mark Twain Bro",
        "top_subjects": [
            "Voyages around the world",
            "Quotations",
            "Protected DAISY",
            "Mississippi River",
            "Mark Twain (1835-1910)",
            "Joan of Arc, Saint (1412-1431)",
            "In library",
            "Description and travel",
            "Christian Science",
            "American Quotations"
            "In library",
            "Description and travel",
            "Christian Science",
            "American Quotations"
        ],
        "top_work": "The wit & wisdom of Mark Twain",
        "type": "author",
        "work_count": 3
    }
}
"commit": {}}
PYTHONPATH=. python ./openlibrary/solr/update.py --ol-url 'localhost:8080' --ol-config './config/openlibrary.yml' --solr-next --update pprint '/works/OL53924W'
{"delete": [
    "/works/ia:completeworksofm10twai",
    "/works/ia:completeworksofm21twai",
    "/works/ia:completeworksofm01twaiiala",
    "/works/ia:completeworksofm00twai"
]
"add": {
    "doc": {
        "alternative_subtitle": [
            "the prince and the pauper.",
            "a tramp abroad."
        ],
        "alternative_title": [
            "The complete works of Mark Twain.: a tramp abroad.",
            "The complete works of Mark Twain [pseud.]",
            "The complete works of Mark Twain.: the prince and the pauper.",
            "The Complete works of Mark Twain.",
            "The complete works of Mark Twain [pseud.]. violet glue flow blubber truck",
            "The complete works of Mark Twain",
            "Haeske Collection.",
            "The complete works of Mark Twain green"
        ],
        "author_alternative_name": [
            "Twain, Mark (Spirit)",
            "TWAIN",
            "Samuel Langhorne Clemens (Mark Twain)",
            "Mark TWAIN",
            "mark twain",
            "Twain, Mark (pseud)",
            "Samuel Langhorne Clemens",
            "Twain",
            "Twain, Mark, 1835-1910",
            "M. Twain",
            "Mark (Samuel L. Clemens) Twain"
        ],
        "author_facet": [
            "OL18319A Mark Twain Mary"
        ],
        "author_key": [
            "OL18319A"
        ],
        "author_name": [
            "Mark Twain Mary"
        ],
        "ebook_access": "public",
        "ebook_count_i": 4,
        "edition_count": 6,
        "edition_key": [
            "OL22782947M",
            "OL24197430M",
            "OL24197450M",
            "OL7037695M",
            "OL13569660M",
            "OL24197475M"
        ],
        "editions": [
            {
                "key": "/books/OL22782947M",
                "type": "edition",
                "title": "The complete works of Mark Twain [pseud.]",
                "alternative_title": [
                    "The complete works of Mark Twain [pseud.]"
                ],
                "language": [
                    "eng"
                ],
                "publisher": [
                    "Harper"
                ],
                "publish_date": [
                    "1875"
                ],
                "publish_year": [
                    1875
                ],
                "ebook_access": "no_ebook",
                "has_fulltext": false,
                "public_scan_b": false
            },
            {
                "key": "/books/OL24197430M",
                "type": "edition",
                "title": "The complete works of Mark Twain [pseud.]. violet glue flow blubber truck",
                "alternative_title": [
                    "The complete works of Mark Twain [pseud.]. violet glue flow blubber truck"
                ],
                "language": [
                    "eng"
                ],
                "publisher": [
                    "Harper & Bros."
                ],
                "publish_date": [
                    "1909"
                ],
                "publish_year": [
                    1909
                ],
                "ia": [
                    "completeworksofm10twai"
                ],
                "ia_collection": [
                    "americana",
                    "internetarchivebooks"
                ],
                "ia_box_id": [
                    "IA111102"
                ],
                "ebook_access": "public",
                "has_fulltext": true,
                "public_scan_b": true
            },
            {
                "key": "/books/OL24197450M",
                "type": "edition",
                "title": "The complete works of Mark Twain.",
                "subtitle": "the prince and the pauper.",
                "alternative_title": [
                    "The complete works of Mark Twain.: the prince and the pauper."
                ],
                "language": [
                    "und"
                ],
                "publisher": [
                    "Harper"
                ],
                "publish_date": [
                    "1909"
                ],
                "publish_year": [
                    1909
                ],
                "ia": [
                    "completeworksofm21twai"
                ],
                "ia_collection": [
                    "americana",
                    "internetarchivebooks"
                ],
                "ia_box_id": [
                    "IA111102"
                ],
                "ebook_access": "public",
                "has_fulltext": true,
                "public_scan_b": true
            },
            {
                "key": "/books/OL7037695M",
                "type": "edition",
                "title": "The complete works of Mark Twain green",
                "alternative_title": [
                    "The complete works of Mark Twain green"
                ],
                "publisher": [
                    "Harper & Brothers"
                ],
                "publish_date": [
                    "1907"
                ],
                "publish_year": [
                    1907
                ],
                "ia": [
                    "completeworksofm01twaiiala"
                ],
                "ia_collection": [
                    "cdl",
                    "americana"
                ],
                "ebook_access": "public",
                "has_fulltext": true,
                "public_scan_b": true
            },
            {
                "key": "/books/OL13569660M",
                "type": "edition",
                "title": "The Complete works of Mark Twain.",
                "alternative_title": [
                    "Haeske Collection.",
                    "The Complete works of Mark Twain."
                ],
                "language": [
                    "eng"
                ],
                "publisher": [
                    "Harper & Brothers"
                ],
                "publish_date": [
                    "1922"
                ],
                "publish_year": [
                    1922
                ],
                "ebook_access": "no_ebook",
                "has_fulltext": false,
                "public_scan_b": false
            },
            {
                "key": "/books/OL24197475M",
                "type": "edition",
                "title": "The complete works of Mark Twain.",
                "subtitle": "a tramp abroad.",
                "alternative_title": [
                    "The complete works of Mark Twain.: a tramp abroad."
                ],
                "language": [
                    "und"
                ],
                "publisher": [
                    "Harper"
                ],
                "publish_date": [
                    "1921"
                ],
                "publish_year": [
                    1921
                ],
                "ia": [
                    "completeworksofm00twai"
                ],
                "ia_collection": [
                    "internetarchivebooks",
                    "americana"
                ],
                "ia_box_id": [
                    "IA114818"
                ],
                "ebook_access": "public",
                "has_fulltext": true,
                "public_scan_b": true
            }
        ],
        "first_publish_year": 1875,
        "has_fulltext": true,
        "ia": [
            "completeworksofm10twai",
            "completeworksofm21twai",
            "completeworksofm01twaiiala",
            "completeworksofm00twai"
        ],
        "ia_collection": [
            "americana",
            "cdl",
            "internetarchivebooks"
        ],
        "ia_collection_s": "americana;cdl;internetarchivebooks",
        "key": "/works/OL53924W",
        "language": [
            "eng",
            "und"
        ],
        "last_modified_i": 1702551604,
        "lcc": [
            "PS-1300.00000000.F11"
        ],
        "lcc_sort": "PS-1300.00000000.F11",
        "lending_edition_s": "OL24197430M",
        "lending_identifier_s": "completeworksofm10twai",
        "number_of_pages_median": 26,
        "oclc": [
            "310756436",
            "28639178",
            "310756332",
            "18314842"
        ],
        "public_scan_b": true,
        "publish_date": [
            "1907",
            "1909",
            "1922",
            "1921",
            "1875"
        ],
        "publish_place": [
            "New York"
        ],
        "publish_year": [
            1921,
            1922,
            1907,
            1875,
            1909
        ],
        "publisher": [
            "Harper & Bros.",
            "Harper & Brothers",
            "Harper"
        ],
        "seed": [
            "/books/OL22782947M",
            "/books/OL24197430M",
            "/books/OL24197450M",
            "/books/OL7037695M",
            "/books/OL13569660M",
            "/books/OL24197475M",
            "/works/OL53924W",
            "/authors/OL18319A",
            "/subjects/voyages_around_the_world",
            "/subjects/description_and_travel",
            "/subjects/christian_science",
            "/subjects/person:joan_of_arc_saint_(1412-1431)",
            "/subjects/place:mississippi_river"
        ],
        "title": "The complete works of Mark Twain",
        "type": "work",
        "subject": [
            "Voyages around the world",
            "Description and travel",
            "Christian Science"
        ],
        "subject_facet": [
            "Voyages around the world",
            "Description and travel",
            "Christian Science"
        ],
        "subject_key": [
            "voyages_around_the_world",
            "description_and_travel",
            "christian_science"
        ],
        "place": [
            "Mississippi River"
        ],
        "place_facet": [
            "Mississippi River"
        ],
        "place_key": [
            "mississippi_river"
        ],
        "person": [
            "Joan of Arc, Saint (1412-1431)"
        ],
        "person_facet": [
            "Joan of Arc, Saint (1412-1431)"
        ],
        "person_key": [
            "joan_of_arc_saint_(1412-1431)"
        ],
        "ia_loaded_id": [
            "completeworksofm12twaiiala"
        ],
        "ia_box_id": [
            "IA114818",
            "IA111102"
        ],
        "readinglog_count": 0,
        "want_to_read_count": 0,
        "currently_reading_count": 0,
        "already_read_count": 0
    }
}
"commit": {}}

Screenshot

Stakeholders

@cdrini cdrini force-pushed the refactor/update-work branch from ef89973 to 1725ebc Compare December 14, 2023 11:03
@cdrini cdrini changed the title Reorganize update_work for easier expansion Re-architect openlibrary.solr / update_work for easier expansion Dec 14, 2023
@cdrini cdrini force-pushed the refactor/update-work branch from eb09199 to 5aaed5e Compare December 14, 2023 11:19
@cdrini cdrini marked this pull request as ready for review December 14, 2023 11:26
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0d5acea) 16.68% compared to head (08d93ed) 16.68%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #8618   +/-   ##
=======================================
  Coverage   16.68%   16.68%           
=======================================
  Files          88       88           
  Lines        4680     4680           
  Branches      835      835           
=======================================
  Hits          781      781           
  Misses       3383     3383           
  Partials      516      516           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cdrini cdrini force-pushed the refactor/update-work branch from 08d93ed to 3baf93d Compare December 14, 2023 13:17
@cdrini cdrini mentioned this pull request Dec 14, 2023
@mekarpeles
Copy link
Member

lgtm, it's possible things will come up during testing, @cdrini and I ran through the code architecture and it looks like a great step in the right direction.

@mekarpeles mekarpeles merged commit 8e3551f into internetarchive:master Dec 14, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants