-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store split modulestore's course indexes in Django/MySQL #27565
Conversation
Thanks for the pull request, @bradenmacdonald! I've created OSPR-5771 to keep track of it in JIRA. As a core committer in this repo, you can merge this once the pull request is approved per the core committer reviewer requirements and according to the agreement with your edX Champion. |
f9debd1
to
0b27c41
Compare
@ormsbee @doctoryes FYI |
FYI @kdmccormick as well. |
Thanks @bradenmacdonald! :-) I'm really excited by the potential to simplify our stack here. Quick thoughts: Product & Support
Performance(FYI @edx/perf-interest)
Technical Annoyances with Migrating
Pruning
|
@ormsbee Thanks for these very helpful thoughts. I'll give you a detailed reply once I've investigated this further and have more useful info. |
@bradenmacdonald Thank you for your contribution! |
This is really exciting! @ormsbee three random thoughs:
|
@kdmccormick: I'm pushing to get modulestore queries out of other LMS apps, but I think the courseware app in LMS is going to use modulestore for quite a while. |
Until the hypothetical |
Thoughts pre-review:
|
I have made a slight improvement to the history so that it at least shows which branches are modified as part of each edit: I also did some tests on this:
So when looking at the change history, you can tell a bit about what's going on: anything that says "Updated draft and published branch" is a structural change to a section/subsection. Anything that says "Updated draft branch" is editing a unit or XBlock. And "Updated published branch" is when changes to a unit were published.
Yes, hopefully :)
Good point, thanks. |
@bradenmacdonald Is this ready for another look by edX? |
@natabene Not quite yet, I still have a couple small issues to solve, and I haven't gotten all the tests passing again. I will ping when ready :) |
6b1ff76
to
f3e9212
Compare
@ormsbee @doctoryes I've been slowly pushing this PR forward and now have almost all the tests passing, except for the |
Yeah, Django-independent modulestore/XModule was an idea that never really panned out. I'd be fine with having those tests run Django as well. |
Agreed. Django is already entrenched into the modulestore (settings, signals, i18n, etc.), so seems fine to me. |
📣 💥 Heads-up: You must either rebase onto master or merge master into your branch to avoid breaking the build. We recently removed diff-quality and introduced lint-amnesty. This means that the automated quality check that has run on your branch doesn't work the same way it will on master. If you have introduced any quality failures, they might pass on the PR but then break the build on master. This branch has been detected to not have commit 2e33565 as an ancestor. Here's how to see for yourself:
If you have any questions, please reach out to the Architecture team (either #edx-shared-architecture on Open edX Slack or #architecture on edX internal). |
f3e9212
to
e0a2346
Compare
Split modulestore persists data in three MongoDB "collections": course_index (list of courses and the current version of each), structure (outline of the courses, and some XBlock fields), and definition (other XBlock fields). While "structure" and "definition" data can get very large, which is one of the reasons MongoDB was chosen for modulestore, the course index data is very small. By moving course index data to MySQL / a django model, we get these advantages: * Full history of changes to the course index data is now preserved * Includes a django admin view to inspect the list of courses and libraries * It's much easier to "reset" a corrupted course to a known working state, by using the simple-history revert tools from the django admin. * The remaining MongoDB collections (structure and definition) are essentially just used as key-value stores of large JSON data structures. This paves the way for future changes that allow migrating courses one at a time from MongoDB to S3, and thus eliminating any use of MongoDB by split modulestore, simplifying the stack.
ffb9ac5
to
5de35f7
Compare
Your PR has finished running tests. There were no failures. |
@ormsbee have you had a chance to take a final look at this? Let me know if you need anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Planning to merge and push to prod in the morning. I'm just trying to figure out if we can easily pause Studio for a bit so that we're not actively writing new data in the narrow window between when the migration runs and the new code is live on Studio (so that we don't get writes that get "lost" in MongoDB-only during the gap).
library_version = models.CharField(max_length=255, null=False, blank=True) | ||
|
||
# Wiki slug for this course | ||
wiki_slug = models.CharField(max_length=255, db_index=True, blank=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment (no action needed): It pains me that wiki_slug
exists at this layer of the code. 😛
# ModuleStoreEnum.UserID.*) are not real user IDs. | ||
edited_by_id = models.IntegerField(null=True) | ||
edited_on = models.DateTimeField() | ||
# last_update is different from edited_on, and is used only to prevent collisions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's probably intended to stop some race condition where you have conflicting writes with slow transactions, but I don't know how effective it's been in practice.
parent_location=self.chapter.location, | ||
# ^ It is very important that we use parent_location=self.chapter.location (and not parent=self.chapter), as | ||
# chapter is a class attribute and passing it by value will update its .children=[] which will then leak | ||
# into other tests and cause errors if the children no longer exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comment! It's a 🤮 situation, but great comment!
@ormsbee, @kdmccormick: thought you might like to know that bradenmacdonald merged this pull request. |
@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future. |
EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production. |
EdX Release Notice: This PR has been deployed to the production environment. |
EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production. |
EdX Release Notice: This PR has been deployed to the production environment. |
Description
This change modifies spit modulestore so that split's course index data is read from and written to MySQL instead of MongoDB.
Background: Split modulestore persists data in three MongoDB "collections":
course_index
(list of courses and the current version of each),structure
(outline of the courses, and some XBlock fields), anddefinition
(other XBlock fields). While "structure" and "definition" data can get very large, which is one of the reasons MongoDB was chosen for modulestore, the course index data is very small.By moving course index data to MySQL / a django model, we get these advantages:
django-simple-history
)structure
anddefinition
) are essentially just used as key-value stores of large JSON data structures.Deadline
None
Performance
It is yet unknown whether this approach is faster or slower. Writes are expected to be slower at first due to writing through to both MySQL and MongoDB (see below).
Migration
This PR is written for immediate cutover: it includes a data migration that copies all course indexes into MySQL, and it starts doing all reads exclusively from MySQL. However, writes are persisted to both MySQL and MongoDB so that this PR can be reverted without any loss of data.
Testing instructions
Create, edit, and learn from both courses and content libraries using Studio. Ensure draft+preview+publish works.
Try rolling back to a version before this PR (that uses Mongo only) and verify that the new updates you made to the course are still there. (This tests that the double-write to MySQL+MongoDB is working, to make this safer to deploy and easier to roll back.)