Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SE-3520] Adds waffle flag to enable overriding existing transcripts on import #268

Conversation

nizarmah
Copy link
Contributor

@nizarmah nizarmah commented Nov 8, 2020

This pull-request is a follow up pull request, related to this comment from a previous pull-request.

Every time we import a new transcript, the existing and the new transcripts are compared by computing their content hash and comparing it.
This helps us identify whether the transcript uploaded is a duplicate of an already existing transcript, or if it is a different transcript that is being uploaded even though a transcript for the same video already exists.

Previously, if a video transcript for a specific video already exists, the new transcript wouldn't get uploaded. Now, if the transcript is different than the already existing one for a certain video, it does get uploaded and overrides it.

JIRA tickets: SE-3520, OSPR-5117

Discussions: GitHub comment on previous pull-request

Sandbox URL:

Installation Instructions:

  • Install using:
pip install --upgrade git+https://github.com/open-craft/edx-val.git@nizar/overriding_existing_non_duplicate_transcripts#egg=edxval
  • Apply migrations to LMS
/edx/bin/edxapp-migrate-lms

Testing instructions:

  • Testing using the Sandbox
    1. Login to the sandbox with staff/edx.
    2. Go to Django Admin -> Waffle (django-waffle) -> Flag and activate the edxval.override_existing_imported_transcripts flag.
    3. Go to Studio and create a new course.
    4. Import one of the courses exported and attached below.
    5. Open the only unit in the course, edit the video, and download the transcript.
    6. Now import the other course, please.
    7. Open the only unit in the course, edit the video, and download the transcript.
    8. The transcript should be different than the other downloaded on step 4.
  • Testing using Makefile
    1. Run make requirements
    2. Then run make test
Test Courses to Import

Author notes and concerns:

  • The tox tests aren't broken due to my changes, they are broken when it comes to python 38 with django 30. I verified that the same problem is happening right now on master.

Reviewers

Settings

EDXAPP_EXTRA_REQUIREMENTS:
  - name: git+https://github.com/open-craft/edx-val.git@nizar/overriding_existing_non_duplicate_transcripts#egg=edxval

@openedx-webhooks
Copy link

Thanks for the pull request, @nizarmah! I've created OSPR-5117 to keep track of it in JIRA, where we prioritize reviews. Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@openedx-webhooks openedx-webhooks added needs triage open-source-contribution PR author is not from Axim or 2U labels Nov 8, 2020
@natabene
Copy link

@nizarmah Thank you for your contribution. Please let me know once it is ready for our review.

@openedx-webhooks openedx-webhooks added waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. and removed needs triage labels Nov 10, 2020
@gabor-boros
Copy link

👍 🎉

  • I tested what's in the test instructions
  • I read through the code
  • NA I checked for accessibility issues
  • NA Includes documentation
  • NA I made sure any change in configuration variables is reflected in the corresponding client's configuration-secure repository.

@nizarmah
Copy link
Contributor Author

@natabene this is ready for edX's review 👍 😄

@natabene
Copy link

@kashifch Could you please review this?

@openedx-webhooks openedx-webhooks added awaiting prioritization and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Nov 16, 2020
Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments. There can be some delay with this PR while some internal evaluation is done around this change. Thank You

edxval/utils.py Outdated
Comment on lines 271 to 280
def is_duplicate_file(uploaded_file, file_hash):
"""
Checks file hash to know if its a duplicate file

Arguments:
uploaded_file (UploadedFile): File which will be used for hash generation
file_hash (str): SHA256 file hash

Returns:
if file is duplicate (boolean)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function design is a bit confusing. Either pass both arguments as hashes or files. I would prefer both args are files and the content hash is then used to check if the files are duplicate or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be addressed now. The method takes two files instead, as you requested 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DawoudSheraz I just realized that there won't be any need to change any migrations or model changes related to the transcript content hash, since we're computing the content hash when trying to import the transcript.

Accordingly, I added a commit which removes the field and its migrations completely.

If you'd like me to revert that change, please let me know.

Computing the content hash every time isn't so efficient, but it's definitely more straight forward from a code point of view.

edxval/api.py Outdated
Comment on lines 1190 to 1192
is_duplicate_file(
new_transcript_content_file,
existing_transcript.transcript_content_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this be backward compatible(given default value is '')?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the hash is empty, then the file would not be treated as a duplicate, and it would be uploaded.

Would you prefer that I apply the generate_file_content_hash on each existing uploaded file instead? The migration would take longer but it would be backwards compatible.

I'll add a commit for that once I target the other comments, that way if it isn't needed, we can just revert it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added backwards compatibility now 👍 sorry for not doing that earlier

edxval/models.py Outdated
Comment on lines 550 to 551
video_transcript.save()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this save here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a line I changed and it went unnoticed by me, sorry about that.

@@ -1770,7 +1770,7 @@ def test_multiple_external_transcripts_for_language(self):

# Verify transcript record is created with correct data i.e sub field transcript.
expected_transcripts = [
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SRT, video_id=edx_video_id, language_code='en')
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SJSON, video_id=edx_video_id, language_code='en')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the behavior changed. Here's a more detailed explanation:

First, we need to identify that:

  • sub_transcript_file_name is an srt video transcript
  • ext_transcript_file_name is an sjson video transcript
# Import xml with empty edx video id.
edx_video_id = api.import_from_xml(
    etree.fromstring('<video_asset/>'),
    '',
    self.file_system,
    constants.EXPORT_IMPORT_STATIC_DIR,
    {
        'en': [sub_transcript_file_name, ext_transcript_file_name]
    }
)

If we take a look at the import segment of code, shown above, we can see that for the same language_code was given two different files/transcripts.

Previously, if a transcript was already uploaded/imported for a certain video, then importing that same video with a new transcript did not overwrite it. So, the initially imported sub_transcript_file_name transcript does not get replaced, and thus the file remains an srt transcript.

Currently, if a transcript was already uploaded/imported for a certain video, then importing the same video with a new transcript will overwrite it if the content is different. Since the sub_transcript_file_name and the ext_transcript_file_name transcripts have different file contents, then when the ext_transcript_file_name transcript was uploaded, the transcript got replaced and became an sjson transcript.

Accordingly, we had to update the test 👍

I tried to keep it brief, so please let me know if you'd like me to go into more details 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the brief. I would say to update the test docstring to cover this change. It will be easier to follow in the future.

@nizarmah nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch 4 times, most recently from 2f09d1c to b23cb0d Compare November 19, 2020 23:59
@nizarmah
Copy link
Contributor Author

Thanks a lot for your review @DawoudSheraz
I've addressed 3 of the comments you left with commits, and I clarified why the test was changed 👍

There can be some delay with this PR while some internal evaluation is done around this change.

No worries 🙂 I totally understand

Please don't hold back any requests/changes that might be needed to accept the PR, even if I have to fully rework it

Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, good to go. Some nits comment to address before merging. Please update edxval version in setup.py. Again, there might be a small delay in the PR merge and release. Thanks for your wait on the review.

edxval/utils.py Outdated Show resolved Hide resolved
edxval/tests/test_utils.py Outdated Show resolved Hide resolved
edxval/tests/test_utils.py Outdated Show resolved Hide resolved
@@ -1770,7 +1770,7 @@ def test_multiple_external_transcripts_for_language(self):

# Verify transcript record is created with correct data i.e sub field transcript.
expected_transcripts = [
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SRT, video_id=edx_video_id, language_code='en')
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SJSON, video_id=edx_video_id, language_code='en')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the brief. I would say to update the test docstring to cover this change. It will be easier to follow in the future.

edxval/tests/test_api.py Outdated Show resolved Hide resolved
Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase and update the branch. Once done, the PR will be good to merge.

setup.py Outdated
@@ -46,7 +46,7 @@ def load_requirements(*requirements_paths):
return list(requirements)


VERSION = '1.4.3'
VERSION = '1.4.4'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase your branch and update the version to 1.4.5. The latest release is 1.4.4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry about that, I missed your earlier comment about this 😳

@nizarmah nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch from adbfa05 to c0b2d86 Compare November 26, 2020 10:29
@nizarmah
Copy link
Contributor Author

@DawoudSheraz I updated the version, as you requested.

Thanks a lot for your review! I hope you have a great day!

@DawoudSheraz
Copy link
Contributor

@nizarmah Hi. A follow-up question. How will this change handle If video ID is already present in edxval then do not overwrite any video transcripts? Because transcript is associated with Video and not CourseVideo. The changes will overwrite the transcript for the video in the original course too.

@nizarmah
Copy link
Contributor Author

@DawoudSheraz Hmm, so this change would always overwrite the video transcript if the video ID is already present and the transcript is different from the existing one.
So yes, as you mentioned, this change will overwrite the transcript for all courses that use that video ID.

The reason behind this is the perspective I had which is: "Video Transcripts shouldn't be course specific, but specific to each Video. So if a Video's Transcript gets updated, it should get updated for every course that uses that Video."

However, that's something that my perspective won't be enough on, and I'd love to know if you'd like me to make any changes to what I've done, because I definitely don't have enough context to make such a decision for such a change.

Hope this answers your question, and I'm looking forward to any tweak I can do in my approach to get this accepted 👍

@DawoudSheraz
Copy link
Contributor

@azarembok Hi. Can you provide your thoughts on this PR? Thanks

@nizarmah
Copy link
Contributor Author

nizarmah commented Dec 8, 2020

Any updates on this @DawoudSheraz ?

@DawoudSheraz
Copy link
Contributor

@nizarmah Hello. Apologies for the delay. I have asked for another internal review. Once that approval is in, I will be merging and releasing this change. Thanks

@DawoudSheraz
Copy link
Contributor

@nizarmah Hello. Thanks for your wait on this review. There has been some internal discussion around the changes involved in this PR. Though the change is risky, it is ok to merge behind a django settings/waffle flag(your choice). The decision is that this change will not be enabled on edX but openedX community is free to enable it on their installations.

@nizarmah
Copy link
Contributor Author

@DawoudSheraz no problem 😃 Glad things worked out at the end. I really appreciate the decision to enable this behind a django settings/waffle flag ❤️ I'll need some time to do that change, since my plate is pretty full at the moment, but I'll let you know when the change is ready 👍🏼

The change should be ready by January 4, approximately. Sorry for that long delay! 🙁

@DawoudSheraz
Copy link
Contributor

@DawoudSheraz no problem 😃 Glad things worked out at the end. I really appreciate the decision to enable this behind a django settings/waffle flag ❤️ I'll need some time to do that change, since my plate is pretty full at the moment, but I'll let you know when the change is ready 👍🏼

The change should be ready by January 4, approximately. Sorry for that long delay! 🙁

No worries, take your time. Let me know when this is ready. Thank You

@nizarmah
Copy link
Contributor Author

nizarmah commented Dec 26, 2020

@DawoudSheraz hope you're having a nice holiday :) I just wanted to let you know that I am planning to use edx-toggles for creating a waffle flag to enable overriding existing transcripts.

Please let me know if you'd like me to take a different approach or if you'd like me to directly just use django-waffle without edx-toggles. 👍🏼


I'll hold off starting the work on the django setting/waffle flag until I have your confirmation, because the edx-toggles library is deprecating namespaces, cf edx/edx-toggles#80. That will require correcting the import statement, since the toggles.__future__ will be removed sometime soon.

I should be able to prepare the changes for the tests and the import transcript function, though.

@DawoudSheraz
Copy link
Contributor

@nizarmah Hello. I am ok with edx-toggles or whatever method suits you best. I saw the PR in edx-toggles but it seems it is still in review. If that change is a breaking change, you can constraint edx-toggle version in edx-val for the time being.

@nizarmah nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch 2 times, most recently from 5353dd4 to 02b39c3 Compare January 1, 2021 14:29
@nizarmah
Copy link
Contributor Author

nizarmah commented Jan 1, 2021

@DawoudSheraz Happy new year 🙂 I didn't come empty handed haha, the changes are ready for your review 👍🏼

I also updated the sandbox, in case you'd like to test, and made staff superuser, so that you can enable/disable the Waffle Flag edxval.override_existing_imported_transcripts however you'd like.


By the way, the tox tests fail for python 3.8 and django 3.0.

This applies to all branches for the time being, so it wasn't caused by my changes. I might consider submitting a pull request to handle that upgrade, during my freetime, in case some other changes are more of a priority for edX than that one 👍🏼


Sorry that the changes are a lot, it's mainly due to the requirements and the tests.
I changed the tests significantly, making them data driven, that way both cases can be tested in the same method.
Let me know please if you have any objection regarding that.

I also wanted to mention that the "exit cases" for the import_transcript_from_fs might be ugly, and separated, but I feel like adding those cases might be nicer than encasing everything in nested if statements .
Let me know if you'd like me to change that please.

By the way, sorry about the force push when merging master.
I had the changes prepared in 7 neat commit messages, but part of the merge bled into the code changes, so I had to revert (until after where you have already reviewed), and then those 7 commits into one 🙁

@nizarmah nizarmah requested a review from DawoudSheraz January 1, 2021 17:08
Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are good. Thanks for updating the PR. Two questions/commits to address before merging.

edxval/models.py Outdated

# save the transcript file
if file_data:
with closing(file_data.open()) as transcript_content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the open(seek) call necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been there since #87 (click here to go to exact line)

So I was hesitant to change something that was already there.

But I changed it, as you suggested, and the tests are running fine. Also, my manual testing went fine... 🤔

Comment on lines +1818 to +1836
There are two different cases here, based on whether **overriding existing transcripts is enabled**.

If overriding existing transcripts is **disabled**:
If a transcript was already uploaded/imported for a certain video, then importing that same video with
a new transcript does not overwrite it.
In this case:
1. Import `sub_transcript_file_name`
1. Import `ext_transcript_file_name`
Since both have the same video id, then `ext_transcript_file_name` would not be imported, so the transcript
would be the `sub_transcript_file_name`.

If overriding existing transcripts is **enabled**:
If a transcript was already uploaded/imported for a certain video, then importing the same video with a new
transcript will overwrite it, if and only if the content is different.
In this case,
1. Import `sub_transcript_file_name`
1. Import `ext_transcript_file_name`
Since both have the same video id, and `ext_transcript_file_name` has different content, it would get
imported, so the transcript would be the `ext_transcript_file_name`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation 👍🏽

Comment on lines +25 to +38
OVERRIDE_EXISTING_IMPORTED_TRANSCRIPTS = WaffleFlag(
waffle_name('override_existing_imported_transcripts'),
module_name=__name__,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add waffle/toggle documentation as per edx-toggles specs.

@nizarmah
Copy link
Contributor Author

nizarmah commented Jan 4, 2021

@DawoudSheraz I addressed your latest comments 👍🏼
Let me know if there are any other changes you'd like me to make 😃

By the way, this is one of the significant contributions I've made, so thanks for helping out with it!

Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase and squash the commits. Thanks for your contribution.

@nizarmah nizarmah changed the title [SE-3520] Adds content hashing to transcript files to replace non-duplicate transcripts on import [SE-3520] Adds waffle flag to enable overriding existing transcripts on import Jan 4, 2021
@nizarmah nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch from aa7dbad to 923097b Compare January 4, 2021 13:10
@nizarmah
Copy link
Contributor Author

nizarmah commented Jan 4, 2021

@DawoudSheraz I have rebased and squashed 👍🏼

Thanks a lot for your review and your time!
Have a wonderful day! 😃

@DawoudSheraz DawoudSheraz merged commit 84cedb3 into openedx:master Jan 4, 2021
@openedx-webhooks
Copy link

@nizarmah 🎉 Your pull request was merged!

Please take a moment to answer a two question survey so we can improve your experience in the future.

@DawoudSheraz
Copy link
Contributor

@nizarmah Thanks for your contribution, effort, patience, and cooperation on this PR. I have drafted https://github.com/edx/edx-val/tree/1.4.5 against your changes. Once it is pushed to PyPI, the changes will become part of edx-platform with the next requirements update.

@nizarmah nizarmah deleted the nizar/overriding_existing_non_duplicate_transcripts branch January 5, 2021 03:06
@nizarmah
Copy link
Contributor Author

https://github.com/edx/edx-val/blob/cd54e796e0e87c5a8a4f6a4243d67b660c1642ba/edxval/config/waffle.py#L25-L38

@robrap I believe the requested documentation changes for edxval.override_existing_imported_transcripts are included in this pull request.

Would you mind please confirming that? Or letting me know what changes would be required?

@nizarmah
Copy link
Contributor Author

Actually, I notice the problem now... It is:
https://github.com/edx/edx-val/blob/cd54e796e0e87c5a8a4f6a4243d67b660c1642ba/edxval/config/waffle.py#L25

I'll create a pull request to fix that 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged open-source-contribution PR author is not from Axim or 2U
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants