Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store split modulestore's course indexes in Django/MySQL #27565

Merged
merged 1 commit into from
Oct 7, 2021

Conversation

bradenmacdonald
Copy link
Contributor

@bradenmacdonald bradenmacdonald commented May 7, 2021

Description

This change modifies spit modulestore so that split's course index data is read from and written to MySQL instead of MongoDB.

Background: Split modulestore persists data in three MongoDB "collections": course_index (list of courses and the current version of each), structure (outline of the courses, and some XBlock fields), and definition (other XBlock fields). While "structure" and "definition" data can get very large, which is one of the reasons MongoDB was chosen for modulestore, the course index data is very small.

By moving course index data to MySQL / a django model, we get these advantages:

  • Full history of changes to the course index data is preserved (using django-simple-history)
  • It's much easier to inspect the list of courses and libraries from the django admin:
    Screen Shot 2021-05-07 at 3 18 59 PM
  • It's much easier to "reset" a corrupted course to a known working state, by using the simple-history revert tools from the django admin.
  • The remaining MongoDB collections (structure and definition) are essentially just used as key-value stores of large JSON data structures.
    • This PR paves the way for a future changes that allows migrating courses one at a time from MongoDB to S3, and thus eliminating any use of MongoDB by split modulestore, simplifying the stack.

Deadline

None

Performance

It is yet unknown whether this approach is faster or slower. Writes are expected to be slower at first due to writing through to both MySQL and MongoDB (see below).

Migration

This PR is written for immediate cutover: it includes a data migration that copies all course indexes into MySQL, and it starts doing all reads exclusively from MySQL. However, writes are persisted to both MySQL and MongoDB so that this PR can be reverted without any loss of data.

Testing instructions

Create, edit, and learn from both courses and content libraries using Studio. Ensure draft+preview+publish works.

Try rolling back to a version before this PR (that uses Mongo only) and verify that the new updates you made to the course are still there. (This tests that the double-write to MySQL+MongoDB is working, to make this safer to deploy and easier to roll back.)

@openedx-webhooks
Copy link

Thanks for the pull request, @bradenmacdonald! I've created OSPR-5771 to keep track of it in JIRA.

As a core committer in this repo, you can merge this once the pull request is approved per the core committer reviewer requirements and according to the agreement with your edX Champion.

@openedx-webhooks openedx-webhooks added core committer open-source-contribution PR author is not from Axim or 2U waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels May 7, 2021
@bradenmacdonald bradenmacdonald force-pushed the braden/course-indexes-mysql branch from f9debd1 to 0b27c41 Compare May 7, 2021 22:39
@bradenmacdonald
Copy link
Contributor Author

@ormsbee @doctoryes FYI

@ormsbee
Copy link
Contributor

ormsbee commented May 8, 2021

FYI @kdmccormick as well.

@ormsbee
Copy link
Contributor

ormsbee commented May 8, 2021

Thanks @bradenmacdonald! :-) I'm really excited by the potential to simplify our stack here.

Quick thoughts:

Product & Support

  • Yay, Django admin! ❤️
  • For the rollback scenario, having a little more metadata about the change might help guide the support team. For instance, the user doing the change is written to the structure document, but perhaps it could be written at this layer as well? Some shorthand summary of the change would be ideal, but probably too much trouble. FYI @mondiaz, since a revert scenario came up quite recently.
  • People have repeatedly asked for an activity log to know when other people are working on the course, since Studio doesn't behave that well with concurrent editing. Having a data model like this might be able to expose that in a lightweight way in the future. (Heads up @marcotuts)
  • Histories can get very long, particularly for structure documents in the draft branch, since every container modification auto-publishes (for backwards compatibility with Old Mongo, which had to do that for technical reasons, sigh...). I guess in the rollback scenario, we can care about a fairly limited subset of those, since you're likely to roll back to yesterday, not five years ago.

Performance

(FYI @edx/perf-interest)

  • I wouldn't be too worried about performance for this first piece. Writes are rare and reads are almost always single row. Throw in a request cache so that repeat requests by the same transaction don't hit MySQL, and it should be smooth sailing.
  • In the longer term, switching over to S3 for the modulestore's KV-store would mean that we would have to start aggressively caching definition documents or the variability of S3 latency will kill us under load. I'm not sure if multi-fetch is practical here, but we should look at anything we can to reduce the number of separate S3 requests we need to make with a cold cache.

Technical Annoyances with Migrating

  • Totally agree on writing to both places in case we need to roll back.
  • This is probably going to break hundreds of tests for query counts. >_<
  • The test data reset behavior of MongoDB and MySQL are different, which could cause you a lot of headaches for test suites that assume content stays the same while wiping out all the other data from one test to the next. This might be worth pushing all the way through and making an in-memory KV-store for structure and definition docs and just removing MongoDB from the test story entirely in Split (and entirely from edx-platform when Old Mongo is killed).
  • We may see other transactional weirdness with having the version updates live in MySQL. I think that Studio still runs with implicit per-view transactions. We might get into funny states where the version updates writes through to both MySQL and MongoDB, but the MySQL change gets rolled back due to an error later on in the view, while the MongoDB one stays at the higher version.

Pruning

  • We prune old structure documents with the structures.py tubular script. It's possible that we'll simply want to stop pruning altogether if things move over to S3 where data storage and backups are cheaper. If we did do pruning though, it would probably go substantially slower via S3.
  • Pruning also means that we will end up with simple_history references to structures that no longer exist anywhere.
  • As long as write-through is maintained, pruning should still function after this first part.

@bradenmacdonald
Copy link
Contributor Author

@ormsbee Thanks for these very helpful thoughts. I'll give you a detailed reply once I've investigated this further and have more useful info.

@natabene
Copy link
Contributor

@bradenmacdonald Thank you for your contribution!

@kdmccormick
Copy link
Member

This is really exciting! @ormsbee three random thoughs:

  • How close do you think we are to making LMS moduletore-independent? Would it be overly optimistic to put the new split_moudlestore_django app in cms (even though it'd also be installed into LMS for now)?
  • Another win that this PR would give us a potential CMS-only firing-off point for Django admin actions like the learning_sequences backfill.
  • In a full LMS/CMS split, I see this filling a role as a stripped-down CourseOverview for CMS, allowing course_overviews to become LMS-only.

@ormsbee
Copy link
Contributor

ormsbee commented May 11, 2021

@kdmccormick: I'm pushing to get modulestore queries out of other LMS apps, but I think the courseware app in LMS is going to use modulestore for quite a while.

@kdmccormick
Copy link
Member

@kdmccormick: I'm pushing to get modulestore queries out of other LMS apps, but I think the courseware app in LMS is going to use modulestore for quite a while.

Until the hypothetical learning_units app exists, right? That usage feels isolated enough that I would hazard keeping new modulestore-related code in cms/. Anyway, I digress. That's not important for this PR; moving directories is easy enough.

@doctoryes
Copy link
Contributor

Thoughts pre-review:

  • As @ormsbee stated above, Studio and the Split modulestore does not handle concurrent editing. Basically, it's the situation described here: http://doctoryes.github.io/mug_talk_modulestore/#167 . This MySQL involvement will be nice in that all active versions will be captured - which won't fix the concurrency problem but could make it easier to detect when a lost course branch occurred via simple-history?
  • When the active versions are eventually removed from MongoDB, the platform will be in a situation where you'd need both MySQL and MongoDB to construct the current draft/active versions of the course. Seems like a necessary interim step - but should be called out when the time comes.
  • This seems like a solid first step!

@bradenmacdonald
Copy link
Contributor Author

bradenmacdonald commented May 12, 2021

@ormsbee

having a little more metadata about the change might help guide the support team. For instance, the user doing the change ...
People have repeatedly asked for an activity log to know when other people are working on the course, since Studio doesn't behave that well with concurrent editing. Having a data model like this might be able to expose that in a lightweight way in the future. (Heads up @marcotuts)
Histories can get very long, particularly for structure documents in the draft branch, since every container modification auto-publishes (for backwards compatibility with Old Mongo, which had to do that for technical reasons, sigh...). I guess in the rollback scenario, we can care about a fairly limited subset of those, since you're likely to roll back to yesterday, not five years ago.

I have made a slight improvement to the history so that it at least shows which branches are modified as part of each edit:

Screen Shot 2021-05-12 at 12 04 27 PM

I also did some tests on this:

  • Adding a new section or subsection each generates only a single new historical record, or two if you customize the title (which you usually would). The draft and published branches get updated simultaneously.
  • Adding a new unit generates one record, or two if you customize the unit title. Only the draft branch is updated until you publish.
  • Adding a new XBlock or making a single edit to an XBlock each generate one record. Only the draft branch is updated until you publish.
  • Publishing a unit generates a single record; only the published branch is updated, as expected.

So when looking at the change history, you can tell a bit about what's going on: anything that says "Updated draft and published branch" is a structural change to a section/subsection. Anything that says "Updated draft branch" is editing a unit or XBlock. And "Updated published branch" is when changes to a unit were published.

@doctoryes

This MySQL involvement will be nice in that all active versions will be captured - which won't fix the concurrency problem but could make it easier to detect when a lost course branch occurred via simple-history?

Yes, hopefully :)

When the active versions are eventually removed from MongoDB, the platform will be in a situation where you'd need both MySQL and MongoDB to construct the current draft/active versions of the course. Seems like a necessary interim step - but should be called out when the time comes.

Good point, thanks.

@natabene natabene removed the waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. label May 24, 2021
@natabene
Copy link
Contributor

@bradenmacdonald Is this ready for another look by edX?

@bradenmacdonald
Copy link
Contributor Author

@natabene Not quite yet, I still have a couple small issues to solve, and I haven't gotten all the tests passing again. I will ping when ready :)

@bradenmacdonald bradenmacdonald force-pushed the braden/course-indexes-mysql branch 4 times, most recently from 6b1ff76 to f3e9212 Compare June 8, 2021 17:01
@bradenmacdonald
Copy link
Contributor Author

@ormsbee @doctoryes I've been slowly pushing this PR forward and now have almost all the tests passing, except for the commonlib-unit/.xmodule.xmodule... tests that test modulestore without loading Django. How do you think I should handle those? Can we change them to run with Django as all the other tests do?

@ormsbee
Copy link
Contributor

ormsbee commented Jun 9, 2021

@ormsbee @doctoryes I've been slowly pushing this PR forward and now have almost all the tests passing, except for the commonlib-unit/.xmodule.xmodule... tests that test modulestore without loading Django. How do you think I should handle those? Can we change them to run with Django as all the other tests do?

Yeah, Django-independent modulestore/XModule was an idea that never really panned out. I'd be fine with having those tests run Django as well.

@doctoryes
Copy link
Contributor

@ormsbee @doctoryes I've been slowly pushing this PR forward and now have almost all the tests passing, except for the commonlib-unit/.xmodule.xmodule... tests that test modulestore without loading Django. How do you think I should handle those? Can we change them to run with Django as all the other tests do?

Agreed. Django is already entrenched into the modulestore (settings, signals, i18n, etc.), so seems fine to me.

@arch-bom-gocd-alerts
Copy link

📣 💥 Heads-up: You must either rebase onto master or merge master into your branch to avoid breaking the build.

We recently removed diff-quality and introduced lint-amnesty. This means that the automated quality check that has run on your branch doesn't work the same way it will on master. If you have introduced any quality failures, they might pass on the PR but then break the build on master.

This branch has been detected to not have commit 2e33565 as an ancestor. Here's how to see for yourself:

git merge-base --is-ancestor 2e335653 braden/course-indexes-mysql && echo "You're all set" || echo "Please rebase onto master or merge master to your branch"

If you have any questions, please reach out to the Architecture team (either #edx-shared-architecture on Open edX Slack or #architecture on edX internal).

@bradenmacdonald bradenmacdonald force-pushed the braden/course-indexes-mysql branch from f3e9212 to e0a2346 Compare June 16, 2021 18:34
Split modulestore persists data in three MongoDB "collections": course_index (list of courses and the current version of each), structure (outline of the courses, and some XBlock fields), and definition (other XBlock fields). While "structure" and "definition" data can get very large, which is one of the reasons MongoDB was chosen for modulestore, the course index data is very small.

By moving course index data to MySQL / a django model, we get these advantages:
* Full history of changes to the course index data is now preserved
* Includes a django admin view to inspect the list of courses and libraries
* It's much easier to "reset" a corrupted course to a known working state, by using the simple-history revert tools from the django admin.
* The remaining MongoDB collections (structure and definition) are essentially just used as key-value stores of large JSON data structures. This paves the way for future changes that allow migrating courses one at a time from MongoDB to S3, and thus eliminating any use of MongoDB by split modulestore, simplifying the stack.
@bradenmacdonald bradenmacdonald force-pushed the braden/course-indexes-mysql branch from ffb9ac5 to 5de35f7 Compare September 29, 2021 22:56
@edx-status-bot
Copy link

Your PR has finished running tests. There were no failures.

@bradenmacdonald
Copy link
Contributor Author

@ormsbee have you had a chance to take a final look at this? Let me know if you need anything.

Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning to merge and push to prod in the morning. I'm just trying to figure out if we can easily pause Studio for a bit so that we're not actively writing new data in the narrow window between when the migration runs and the new code is live on Studio (so that we don't get writes that get "lost" in MongoDB-only during the gap).

library_version = models.CharField(max_length=255, null=False, blank=True)

# Wiki slug for this course
wiki_slug = models.CharField(max_length=255, db_index=True, blank=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment (no action needed): It pains me that wiki_slug exists at this layer of the code. 😛

# ModuleStoreEnum.UserID.*) are not real user IDs.
edited_by_id = models.IntegerField(null=True)
edited_on = models.DateTimeField()
# last_update is different from edited_on, and is used only to prevent collisions?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's probably intended to stop some race condition where you have conflicting writes with slow transactions, but I don't know how effective it's been in practice.

parent_location=self.chapter.location,
# ^ It is very important that we use parent_location=self.chapter.location (and not parent=self.chapter), as
# chapter is a class attribute and passing it by value will update its .children=[] which will then leak
# into other tests and cause errors if the children no longer exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment! It's a 🤮 situation, but great comment!

@ormsbee ormsbee merged commit 96e5ff8 into openedx:master Oct 7, 2021
@openedx-webhooks
Copy link

@ormsbee, @kdmccormick: thought you might like to know that bradenmacdonald merged this pull request.

@openedx-webhooks
Copy link

@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the production environment.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the production environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants