Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Open Syllabus Project totals to solr #8395

Merged

Conversation

RayBB
Copy link
Collaborator

@RayBB RayBB commented Oct 7, 2023

Closes #5969

Starts importing Open Syllabus Project totals to solr

Technical

Following the steps here.

Testing

  1. Download and unzip the open_syllabus_files (from Mek)
  2. Update theinput_directory in scripts/open_syllabus_project_parser.py to match unzipped folder
  3. PYTHONPATH=$(PWD) python3 scripts/open_syllabus_project_parser.py - from root
  4. Follow steps in making changes to solr config
  5. Visit http://localhost:8080/search.json?q=the and CMD+F for osp_count

Screenshot

N/A

Next Steps

Mek will use this data to help our educational selection :)

Stakeholders

@mekarpeles

@RayBB RayBB marked this pull request as ready for review October 7, 2023 02:28
@RayBB RayBB requested a review from mekarpeles October 7, 2023 02:34
Copy link
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first pass, this looks like it's in the right direction! Great work @RayBB

@mekarpeles mekarpeles added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Oct 9, 2023
@RayBB RayBB requested a review from mekarpeles October 10, 2023 22:14
@RayBB RayBB added the Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] label Oct 10, 2023
@RayBB
Copy link
Collaborator Author

RayBB commented Oct 10, 2023

@mekarpeles great! Let me know what the next steps are. If I should send you or Drini the db file or if you will just produce it yourselves using the script in the folder.

The original issue mentioned linking back to OSP. I don't have access to that data but would be happy to do that in the next PR if it's still desired.

Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiice! Excited to see this going into the solr 😊

openlibrary/utils/open_syllabus_project.py Outdated Show resolved Hide resolved
openlibrary/utils/open_syllabus_project.py Outdated Show resolved Hide resolved
openlibrary/utils/open_syllabus_project.py Outdated Show resolved Hide resolved
scripts/open_syllabus_project_parser.py Outdated Show resolved Hide resolved
openlibrary/utils/open_syllabus_project.py Outdated Show resolved Hide resolved
openlibrary/utils/open_syllabus_project.py Outdated Show resolved Hide resolved
openlibrary/utils/open_syllabus_project.py Outdated Show resolved Hide resolved
scripts/open_syllabus_project_parser.py Outdated Show resolved Hide resolved
olid_int = olid.replace("/works/", "").replace("OL", "").replace("W", "")
try:
# Connect to the SQLite database
conn = sqlite3.connect(db_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Ray raises some potential perf concerns:

  • Is this going to slow down our indexing/full reindexing?
  • Is opening a new sqlite connection every time we call this fn a good idea? Or should we keep a long-lived connection open?
  • This is ~1,000,000 rows of data ; Since it's effectively just a map of int -> int, storing the whole thing in mem is also potentially an option and would use ~8mb (4bytes for each int). Is this something we want?

I'm not an expert on sqlite; Maybe we just time this function to see how long it takes? But I think the we should be fine. If we notice a perf issue we can easily turn this function off until we fix it!

@mekarpeles mekarpeles assigned mekarpeles and unassigned cdrini Dec 11, 2023
@RayBB RayBB force-pushed the feature/open_syllabus_project_totals branch from b577602 to 0e871a9 Compare January 27, 2024 00:54
@RayBB RayBB requested a review from cdrini January 27, 2024 01:46
@cdrini cdrini assigned cdrini and unassigned mekarpeles Jan 29, 2024
@cdrini cdrini removed the Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] label Jan 29, 2024
@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jan 29, 2024
@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d0e2d47) 16.64% compared to head (da59c50) 16.57%.
Report is 46 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8395      +/-   ##
==========================================
- Coverage   16.64%   16.57%   -0.08%     
==========================================
  Files          88       89       +1     
  Lines        4692     4712      +20     
  Branches      836      841       +5     
==========================================
  Hits          781      781              
- Misses       3394     3411      +17     
- Partials      517      520       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cdrini cdrini removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Feb 12, 2024
@cdrini cdrini force-pushed the feature/open_syllabus_project_totals branch from fa3c17f to da59c50 Compare February 13, 2024 23:27
The OSP_DUMP_FILE environment variable was only accessible in a different stage of the jenkins pipeline. The name `osp_totals.db` is pretty unique so shouldn't be too hard to find int he future.
@RayBB
Copy link
Collaborator Author

RayBB commented Feb 14, 2024

@cdrini just took a look and all your changes make sense to me. 🚀

@cdrini cdrini merged commit c506c1b into internetarchive:master Feb 14, 2024
4 checks passed
@cdrini cdrini mentioned this pull request Feb 14, 2024
@cdrini
Copy link
Collaborator

cdrini commented Feb 14, 2024

Sweet ! Will be released at the next full reindex: #8234

python scripts/solr_updater.py $OL_CONFIG \
--state-file /solr-updater-data/$STATE_FILE \
--ol-url "$OL_URL" \
--osp-dump "$OSP_DUMP_LOCATION" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdrini this is failing due to osp_dump being a positional argument. A similar thing is happening in the reindex-solr Makefile target.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah darnit! Good catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Load Open Syllabus Project (OSP) IDs & curriculum counts into Open Library Solr
5 participants