[SE-3520] Adds waffle flag to enable overriding existing transcripts on import #268

nizarmah · 2020-11-08T17:04:30Z

This pull-request is a follow up pull request, related to this comment from a previous pull-request.

Every time we import a new transcript, the existing and the new transcripts are compared by computing their content hash and comparing it.
This helps us identify whether the transcript uploaded is a duplicate of an already existing transcript, or if it is a different transcript that is being uploaded even though a transcript for the same video already exists.

Previously, if a video transcript for a specific video already exists, the new transcript wouldn't get uploaded. Now, if the transcript is different than the already existing one for a certain video, it does get uploaded and overrides it.

JIRA tickets: SE-3520, OSPR-5117

Discussions: GitHub comment on previous pull-request

Sandbox URL:

Installation Instructions:

Install using:

pip install --upgrade git+https://github.com/open-craft/edx-val.git@nizar/overriding_existing_non_duplicate_transcripts#egg=edxval

Apply migrations to LMS

/edx/bin/edxapp-migrate-lms

Testing instructions:

Testing using the Sandbox
1. Login to the sandbox with staff/edx.
2. Go to Django Admin -> Waffle (django-waffle) -> Flag and activate the edxval.override_existing_imported_transcripts flag.
3. Go to Studio and create a new course.
4. Import one of the courses exported and attached below.
5. Open the only unit in the course, edit the video, and download the transcript.
6. Now import the other course, please.
7. Open the only unit in the course, edit the video, and download the transcript.
8. The transcript should be different than the other downloaded on step 4.
Testing using Makefile
1. Run make requirements
2. Then run make test

Test Courses to Import

First Course

Second Course

Author notes and concerns:

The tox tests aren't broken due to my changes, they are broken when it comes to python 38 with django 30. I verified that the same problem is happening right now on master.

Reviewers

@gabor-boros
edX reviewer[s] TBD

Settings

EDXAPP_EXTRA_REQUIREMENTS:
  - name: git+https://github.com/open-craft/edx-val.git@nizar/overriding_existing_non_duplicate_transcripts#egg=edxval

openedx-webhooks · 2020-11-08T17:04:36Z

Thanks for the pull request, @nizarmah! I've created OSPR-5117 to keep track of it in JIRA, where we prioritize reviews. Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket:

supporting documentation
Open edX discussion forum threads
timeline information ("this must be merged by XX date", and why that is)
partner information ("this is a course on edx.org")
any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

natabene · 2020-11-10T17:16:57Z

@nizarmah Thank you for your contribution. Please let me know once it is ready for our review.

gabor-boros · 2020-11-11T09:56:42Z

👍 🎉

I tested what's in the test instructions
I read through the code
NA I checked for accessibility issues
NA Includes documentation
NA I made sure any change in configuration variables is reflected in the corresponding client's configuration-secure repository.

nizarmah · 2020-11-11T09:59:40Z

@natabene this is ready for edX's review 👍 😄

natabene · 2020-11-16T17:58:29Z

@kashifch Could you please review this?

DawoudSheraz

Added a few comments. There can be some delay with this PR while some internal evaluation is done around this change. Thank You

DawoudSheraz · 2020-11-19T16:56:25Z

edxval/utils.py

+def is_duplicate_file(uploaded_file, file_hash):
+    """
+    Checks file hash to know if its a duplicate file
+
+    Arguments:
+        uploaded_file (UploadedFile): File which will be used for hash generation
+        file_hash (str): SHA256 file hash
+
+    Returns:
+        if file is duplicate (boolean)
+    """


The function design is a bit confusing. Either pass both arguments as hashes or files. I would prefer both args are files and the content hash is then used to check if the files are duplicate or not.

This should be addressed now. The method takes two files instead, as you requested 👍

@DawoudSheraz I just realized that there won't be any need to change any migrations or model changes related to the transcript content hash, since we're computing the content hash when trying to import the transcript.

Accordingly, I added a commit which removes the field and its migrations completely.

If you'd like me to revert that change, please let me know.

Computing the content hash every time isn't so efficient, but it's definitely more straight forward from a code point of view.

DawoudSheraz · 2020-11-19T17:16:31Z

edxval/api.py

+        is_duplicate_file(
+            new_transcript_content_file,
+            existing_transcript.transcript_content_hash


How will this be backward compatible(given default value is '')?

If the hash is empty, then the file would not be treated as a duplicate, and it would be uploaded.

Would you prefer that I apply the generate_file_content_hash on each existing uploaded file instead? The migration would take longer but it would be backwards compatible.

I'll add a commit for that once I target the other comments, that way if it isn't needed, we can just revert it.

I have added backwards compatibility now 👍 sorry for not doing that earlier

DawoudSheraz · 2020-11-19T17:19:08Z

edxval/models.py

+        video_transcript.save()
+


Why this save here?

That's a line I changed and it went unnoticed by me, sorry about that.

DawoudSheraz · 2020-11-19T17:20:26Z

edxval/tests/test_api.py

@@ -1770,7 +1770,7 @@ def test_multiple_external_transcripts_for_language(self):

        # Verify transcript record is created with correct data i.e sub field transcript.
        expected_transcripts = [
-            dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SRT, video_id=edx_video_id, language_code='en')
+            dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SJSON, video_id=edx_video_id, language_code='en')


Why is this changed?

Because the behavior changed. Here's a more detailed explanation:

First, we need to identify that:

sub_transcript_file_name is an srt video transcript

ext_transcript_file_name is an sjson video transcript

# Import xml with empty edx video id. edx_video_id = api.import_from_xml( etree.fromstring('<video_asset/>'), '', self.file_system, constants.EXPORT_IMPORT_STATIC_DIR, { 'en': [sub_transcript_file_name, ext_transcript_file_name] } )

If we take a look at the import segment of code, shown above, we can see that for the same language_code was given two different files/transcripts.

Previously, if a transcript was already uploaded/imported for a certain video, then importing that same video with a new transcript did not overwrite it. So, the initially imported sub_transcript_file_name transcript does not get replaced, and thus the file remains an srt transcript.

Currently, if a transcript was already uploaded/imported for a certain video, then importing the same video with a new transcript will overwrite it if the content is different. Since the sub_transcript_file_name and the ext_transcript_file_name transcripts have different file contents, then when the ext_transcript_file_name transcript was uploaded, the transcript got replaced and became an sjson transcript.

Accordingly, we had to update the test 👍

I tried to keep it brief, so please let me know if you'd like me to go into more details 🙂

Thanks for the brief. I would say to update the test docstring to cover this change. It will be easier to follow in the future.

nizarmah · 2020-11-20T00:13:32Z

Thanks a lot for your review @DawoudSheraz
I've addressed 3 of the comments you left with commits, and I clarified why the test was changed 👍

There can be some delay with this PR while some internal evaluation is done around this change.

No worries 🙂 I totally understand

Please don't hold back any requests/changes that might be needed to accept the PR, even if I have to fully rework it

DawoudSheraz

Overall, good to go. Some nits comment to address before merging. Please update edxval version in setup.py. Again, there might be a small delay in the PR merge and release. Thanks for your wait on the review.

edxval/utils.py

edxval/tests/test_utils.py

DawoudSheraz · 2020-11-24T13:34:27Z

edxval/tests/test_api.py

@@ -1770,7 +1770,7 @@ def test_multiple_external_transcripts_for_language(self):

        # Verify transcript record is created with correct data i.e sub field transcript.
        expected_transcripts = [
-            dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SRT, video_id=edx_video_id, language_code='en')
+            dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SJSON, video_id=edx_video_id, language_code='en')


Thanks for the brief. I would say to update the test docstring to cover this change. It will be easier to follow in the future.

edxval/tests/test_api.py

DawoudSheraz

Please rebase and update the branch. Once done, the PR will be good to merge.

DawoudSheraz · 2020-11-26T10:17:35Z

setup.py

@@ -46,7 +46,7 @@ def load_requirements(*requirements_paths):
    return list(requirements)


-VERSION = '1.4.3'
+VERSION = '1.4.4'


Please rebase your branch and update the version to 1.4.5. The latest release is 1.4.4

Yes, sorry about that, I missed your earlier comment about this 😳

nizarmah · 2020-11-26T10:31:00Z

@DawoudSheraz I updated the version, as you requested.

Thanks a lot for your review! I hope you have a great day!

DawoudSheraz · 2020-11-26T15:32:21Z

@nizarmah Hi. A follow-up question. How will this change handle If video ID is already present in edxval then do not overwrite any video transcripts? Because transcript is associated with Video and not CourseVideo. The changes will overwrite the transcript for the video in the original course too.

nizarmah · 2020-11-26T15:49:02Z

@DawoudSheraz Hmm, so this change would always overwrite the video transcript if the video ID is already present and the transcript is different from the existing one.
So yes, as you mentioned, this change will overwrite the transcript for all courses that use that video ID.

The reason behind this is the perspective I had which is: "Video Transcripts shouldn't be course specific, but specific to each Video. So if a Video's Transcript gets updated, it should get updated for every course that uses that Video."

However, that's something that my perspective won't be enough on, and I'd love to know if you'd like me to make any changes to what I've done, because I definitely don't have enough context to make such a decision for such a change.

Hope this answers your question, and I'm looking forward to any tweak I can do in my approach to get this accepted 👍

DawoudSheraz · 2020-11-27T05:30:08Z

@azarembok Hi. Can you provide your thoughts on this PR? Thanks

nizarmah · 2020-12-08T10:23:31Z

Any updates on this @DawoudSheraz ?

DawoudSheraz · 2020-12-14T06:22:13Z

@nizarmah Hello. Apologies for the delay. I have asked for another internal review. Once that approval is in, I will be merging and releasing this change. Thanks

DawoudSheraz · 2020-12-21T10:00:43Z

@nizarmah Hello. Thanks for your wait on this review. There has been some internal discussion around the changes involved in this PR. Though the change is risky, it is ok to merge behind a django settings/waffle flag(your choice). The decision is that this change will not be enabled on edX but openedX community is free to enable it on their installations.

nizarmah · 2020-12-21T12:31:59Z

@DawoudSheraz no problem 😃 Glad things worked out at the end. I really appreciate the decision to enable this behind a django settings/waffle flag ❤️ I'll need some time to do that change, since my plate is pretty full at the moment, but I'll let you know when the change is ready 👍🏼

The change should be ready by January 4, approximately. Sorry for that long delay! 🙁

DawoudSheraz · 2020-12-21T12:35:25Z

@DawoudSheraz no problem 😃 Glad things worked out at the end. I really appreciate the decision to enable this behind a django settings/waffle flag ❤️ I'll need some time to do that change, since my plate is pretty full at the moment, but I'll let you know when the change is ready 👍🏼

The change should be ready by January 4, approximately. Sorry for that long delay! 🙁

No worries, take your time. Let me know when this is ready. Thank You

nizarmah · 2020-12-26T14:57:48Z

@DawoudSheraz hope you're having a nice holiday :) I just wanted to let you know that I am planning to use edx-toggles for creating a waffle flag to enable overriding existing transcripts.

Please let me know if you'd like me to take a different approach or if you'd like me to directly just use django-waffle without edx-toggles. 👍🏼

I'll hold off starting the work on the django setting/waffle flag until I have your confirmation, because the edx-toggles library is deprecating namespaces, cf edx/edx-toggles#80. That will require correcting the import statement, since the toggles.__future__ will be removed sometime soon.

I should be able to prepare the changes for the tests and the import transcript function, though.

DawoudSheraz · 2020-12-28T05:33:30Z

@nizarmah Hello. I am ok with edx-toggles or whatever method suits you best. I saw the PR in edx-toggles but it seems it is still in review. If that change is a breaking change, you can constraint edx-toggle version in edx-val for the time being.

nizarmah · 2021-01-01T17:07:18Z

@DawoudSheraz Happy new year 🙂 I didn't come empty handed haha, the changes are ready for your review 👍🏼

I also updated the sandbox, in case you'd like to test, and made staff superuser, so that you can enable/disable the Waffle Flag edxval.override_existing_imported_transcripts however you'd like.

By the way, the tox tests fail for python 3.8 and django 3.0.

This applies to all branches for the time being, so it wasn't caused by my changes. I might consider submitting a pull request to handle that upgrade, during my freetime, in case some other changes are more of a priority for edX than that one 👍🏼

Sorry that the changes are a lot, it's mainly due to the requirements and the tests.
I changed the tests significantly, making them data driven, that way both cases can be tested in the same method.
Let me know please if you have any objection regarding that.

I also wanted to mention that the "exit cases" for the import_transcript_from_fs might be ugly, and separated, but I feel like adding those cases might be nicer than encasing everything in nested if statements .
Let me know if you'd like me to change that please.

By the way, sorry about the force push when merging master.
I had the changes prepared in 7 neat commit messages, but part of the merge bled into the code changes, so I had to revert (until after where you have already reviewed), and then those 7 commits into one 🙁

DawoudSheraz

The changes are good. Thanks for updating the PR. Two questions/commits to address before merging.

DawoudSheraz · 2021-01-04T07:52:29Z

edxval/models.py

+
+        # save the transcript file
+        if file_data:
+            with closing(file_data.open()) as transcript_content:


Is the open(seek) call necessary here?

This has been there since #87 (click here to go to exact line)

So I was hesitant to change something that was already there.

But I changed it, as you suggested, and the tests are running fine. Also, my manual testing went fine... 🤔

DawoudSheraz · 2021-01-04T07:55:22Z

edxval/tests/test_api.py

+        There are two different cases here, based on whether **overriding existing transcripts is enabled**.
+
+        If overriding existing transcripts is **disabled**:
+            If a transcript was already uploaded/imported for a certain video, then importing that same video with
+            a new transcript does not overwrite it.
+            In this case:
+                1. Import `sub_transcript_file_name`
+                1. Import `ext_transcript_file_name`
+            Since both have the same video id, then `ext_transcript_file_name` would not be imported, so the transcript
+            would be the `sub_transcript_file_name`.
+
+        If overriding existing transcripts is **enabled**:
+            If a transcript was already uploaded/imported for a certain video, then importing the same video with a new
+            transcript will overwrite it, if and only if the content is different.
+            In this case,
+                1. Import `sub_transcript_file_name`
+                1. Import `ext_transcript_file_name`
+            Since both have the same video id, and `ext_transcript_file_name` has different content, it would get
+            imported, so the transcript would be the `ext_transcript_file_name`.


Great explanation 👍🏽

DawoudSheraz · 2021-01-04T07:58:51Z

edxval/config/waffle.py

+OVERRIDE_EXISTING_IMPORTED_TRANSCRIPTS = WaffleFlag(
+    waffle_name('override_existing_imported_transcripts'),
+    module_name=__name__,
+)


nit: add waffle/toggle documentation as per edx-toggles specs.

nizarmah · 2021-01-04T10:18:21Z

@DawoudSheraz I addressed your latest comments 👍🏼
Let me know if there are any other changes you'd like me to make 😃

By the way, this is one of the significant contributions I've made, so thanks for helping out with it!

DawoudSheraz

Please rebase and squash the commits. Thanks for your contribution.

nizarmah · 2021-01-04T13:14:35Z

@DawoudSheraz I have rebased and squashed 👍🏼

Thanks a lot for your review and your time!
Have a wonderful day! 😃

openedx-webhooks · 2021-01-04T13:25:48Z

@nizarmah 🎉 Your pull request was merged!

Please take a moment to answer a two question survey so we can improve your experience in the future.

DawoudSheraz · 2021-01-04T13:28:49Z

@nizarmah Thanks for your contribution, effort, patience, and cooperation on this PR. I have drafted https://github.com/edx/edx-val/tree/1.4.5 against your changes. Once it is pushed to PyPI, the changes will become part of edx-platform with the next requirements update.

nizarmah · 2021-01-26T16:21:16Z

https://github.com/edx/edx-val/blob/cd54e796e0e87c5a8a4f6a4243d67b660c1642ba/edxval/config/waffle.py#L25-L38

@robrap I believe the requested documentation changes for edxval.override_existing_imported_transcripts are included in this pull request.

Would you mind please confirming that? Or letting me know what changes would be required?

nizarmah · 2021-01-26T23:16:04Z

Actually, I notice the problem now... It is:
https://github.com/edx/edx-val/blob/cd54e796e0e87c5a8a4f6a4243d67b660c1642ba/edxval/config/waffle.py#L25

I'll create a pull request to fix that 👍🏼

openedx-webhooks added needs triage open-source-contribution PR author is not from Axim or 2U labels Nov 8, 2020

openedx-webhooks added waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. and removed needs triage labels Nov 10, 2020

gabor-boros approved these changes Nov 11, 2020

View reviewed changes

openedx-webhooks added awaiting prioritization and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Nov 16, 2020

DawoudSheraz reviewed Nov 19, 2020

View reviewed changes

nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch 4 times, most recently from 2f09d1c to b23cb0d Compare November 19, 2020 23:59

nizarmah requested a review from DawoudSheraz November 21, 2020 20:40

DawoudSheraz approved these changes Nov 24, 2020

View reviewed changes

DawoudSheraz approved these changes Nov 26, 2020

View reviewed changes

nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch from adbfa05 to c0b2d86 Compare November 26, 2020 10:29

DawoudSheraz requested a review from azarembok November 27, 2020 05:29

openedx-webhooks added engineering review and removed awaiting prioritization labels Nov 30, 2020

openedx-webhooks added changes requested and removed engineering review labels Dec 22, 2020

nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch 2 times, most recently from 5353dd4 to 02b39c3 Compare January 1, 2021 14:29

nizarmah requested a review from DawoudSheraz January 1, 2021 17:08

DawoudSheraz reviewed Jan 4, 2021

View reviewed changes

DawoudSheraz approved these changes Jan 4, 2021

View reviewed changes

Adds waffle flag to enable overriding existing transcripts on import

923097b

nizarmah changed the title ~~[SE-3520] Adds content hashing to transcript files to replace non-duplicate transcripts on import~~ [SE-3520] Adds waffle flag to enable overriding existing transcripts on import Jan 4, 2021

nizarmah force-pushed the nizar/overriding_existing_non_duplicate_transcripts branch from aa7dbad to 923097b Compare January 4, 2021 13:10

DawoudSheraz merged commit 84cedb3 into openedx:master Jan 4, 2021

openedx-webhooks added merged and removed changes requested labels Jan 4, 2021

nizarmah deleted the nizar/overriding_existing_non_duplicate_transcripts branch January 5, 2021 03:06

nizarmah mentioned this pull request Jan 26, 2021

[TSD] fixes toggle annotation name #279

Merged

[SE-3520] Adds waffle flag to enable overriding existing transcripts on import #268

[SE-3520] Adds waffle flag to enable overriding existing transcripts on import #268

Conversation

nizarmah commented Nov 8, 2020 • edited Loading

Test Courses to Import

openedx-webhooks commented Nov 8, 2020

natabene commented Nov 10, 2020

gabor-boros commented Nov 11, 2020

nizarmah commented Nov 11, 2020

natabene commented Nov 16, 2020

DawoudSheraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nizarmah commented Nov 20, 2020

DawoudSheraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DawoudSheraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nizarmah commented Nov 26, 2020

DawoudSheraz commented Nov 26, 2020

nizarmah commented Nov 26, 2020

DawoudSheraz commented Nov 27, 2020

nizarmah commented Dec 8, 2020

DawoudSheraz commented Dec 14, 2020

DawoudSheraz commented Dec 21, 2020

nizarmah commented Dec 21, 2020

DawoudSheraz commented Dec 21, 2020

nizarmah commented Dec 26, 2020 • edited Loading

DawoudSheraz commented Dec 28, 2020

nizarmah commented Jan 1, 2021

DawoudSheraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nizarmah commented Jan 4, 2021

DawoudSheraz left a comment

Choose a reason for hiding this comment

nizarmah commented Jan 4, 2021

openedx-webhooks commented Jan 4, 2021

DawoudSheraz commented Jan 4, 2021

nizarmah commented Jan 26, 2021

nizarmah commented Jan 26, 2021

nizarmah commented Nov 8, 2020 •

edited

Loading

nizarmah commented Dec 26, 2020 •

edited

Loading