-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SE-3520] Adds waffle flag to enable overriding existing transcripts on import #268
[SE-3520] Adds waffle flag to enable overriding existing transcripts on import #268
Conversation
Thanks for the pull request, @nizarmah! I've created OSPR-5117 to keep track of it in JIRA, where we prioritize reviews. Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
@nizarmah Thank you for your contribution. Please let me know once it is ready for our review. |
👍 🎉
|
@natabene this is ready for edX's review 👍 😄 |
@kashifch Could you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments. There can be some delay with this PR while some internal evaluation is done around this change. Thank You
edxval/utils.py
Outdated
def is_duplicate_file(uploaded_file, file_hash): | ||
""" | ||
Checks file hash to know if its a duplicate file | ||
|
||
Arguments: | ||
uploaded_file (UploadedFile): File which will be used for hash generation | ||
file_hash (str): SHA256 file hash | ||
|
||
Returns: | ||
if file is duplicate (boolean) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function design is a bit confusing. Either pass both arguments as hashes or files. I would prefer both args are files and the content hash is then used to check if the files are duplicate or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be addressed now. The method takes two files instead, as you requested 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DawoudSheraz I just realized that there won't be any need to change any migrations or model changes related to the transcript content hash, since we're computing the content hash when trying to import the transcript.
Accordingly, I added a commit which removes the field and its migrations completely.
If you'd like me to revert that change, please let me know.
Computing the content hash every time isn't so efficient, but it's definitely more straight forward from a code point of view.
edxval/api.py
Outdated
is_duplicate_file( | ||
new_transcript_content_file, | ||
existing_transcript.transcript_content_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this be backward compatible(given default value is '')?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the hash is empty, then the file would not be treated as a duplicate, and it would be uploaded.
Would you prefer that I apply the generate_file_content_hash
on each existing uploaded file instead? The migration would take longer but it would be backwards compatible.
I'll add a commit for that once I target the other comments, that way if it isn't needed, we can just revert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added backwards compatibility now 👍 sorry for not doing that earlier
edxval/models.py
Outdated
video_transcript.save() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this save here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a line I changed and it went unnoticed by me, sorry about that.
edxval/tests/test_api.py
Outdated
@@ -1770,7 +1770,7 @@ def test_multiple_external_transcripts_for_language(self): | |||
|
|||
# Verify transcript record is created with correct data i.e sub field transcript. | |||
expected_transcripts = [ | |||
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SRT, video_id=edx_video_id, language_code='en') | |||
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SJSON, video_id=edx_video_id, language_code='en') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the behavior changed. Here's a more detailed explanation:
First, we need to identify that:
sub_transcript_file_name
is an srt video transcriptext_transcript_file_name
is an sjson video transcript
# Import xml with empty edx video id.
edx_video_id = api.import_from_xml(
etree.fromstring('<video_asset/>'),
'',
self.file_system,
constants.EXPORT_IMPORT_STATIC_DIR,
{
'en': [sub_transcript_file_name, ext_transcript_file_name]
}
)
If we take a look at the import segment of code, shown above, we can see that for the same language_code was given two different files/transcripts.
Previously, if a transcript was already uploaded/imported for a certain video, then importing that same video with a new transcript did not overwrite it. So, the initially imported sub_transcript_file_name
transcript does not get replaced, and thus the file remains an srt transcript.
Currently, if a transcript was already uploaded/imported for a certain video, then importing the same video with a new transcript will overwrite it if the content is different. Since the sub_transcript_file_name
and the ext_transcript_file_name
transcripts have different file contents, then when the ext_transcript_file_name
transcript was uploaded, the transcript got replaced and became an sjson transcript.
Accordingly, we had to update the test 👍
I tried to keep it brief, so please let me know if you'd like me to go into more details 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the brief. I would say to update the test docstring to cover this change. It will be easier to follow in the future.
2f09d1c
to
b23cb0d
Compare
Thanks a lot for your review @DawoudSheraz
No worries 🙂 I totally understand Please don't hold back any requests/changes that might be needed to accept the PR, even if I have to fully rework it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, good to go. Some nits comment to address before merging. Please update edxval version in setup.py. Again, there might be a small delay in the PR merge and release. Thanks for your wait on the review.
edxval/tests/test_api.py
Outdated
@@ -1770,7 +1770,7 @@ def test_multiple_external_transcripts_for_language(self): | |||
|
|||
# Verify transcript record is created with correct data i.e sub field transcript. | |||
expected_transcripts = [ | |||
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SRT, video_id=edx_video_id, language_code='en') | |||
dict(constants.VIDEO_TRANSCRIPT_CUSTOM_SJSON, video_id=edx_video_id, language_code='en') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the brief. I would say to update the test docstring to cover this change. It will be easier to follow in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase and update the branch. Once done, the PR will be good to merge.
setup.py
Outdated
@@ -46,7 +46,7 @@ def load_requirements(*requirements_paths): | |||
return list(requirements) | |||
|
|||
|
|||
VERSION = '1.4.3' | |||
VERSION = '1.4.4' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase your branch and update the version to 1.4.5. The latest release is 1.4.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry about that, I missed your earlier comment about this 😳
adbfa05
to
c0b2d86
Compare
@DawoudSheraz I updated the version, as you requested. Thanks a lot for your review! I hope you have a great day! |
@nizarmah Hi. A follow-up question. How will this change handle |
@DawoudSheraz Hmm, so this change would always overwrite the video transcript if the video ID is already present and the transcript is different from the existing one. The reason behind this is the perspective I had which is: "Video Transcripts shouldn't be course specific, but specific to each Video. So if a Video's Transcript gets updated, it should get updated for every course that uses that Video." However, that's something that my perspective won't be enough on, and I'd love to know if you'd like me to make any changes to what I've done, because I definitely don't have enough context to make such a decision for such a change. Hope this answers your question, and I'm looking forward to any tweak I can do in my approach to get this accepted 👍 |
@azarembok Hi. Can you provide your thoughts on this PR? Thanks |
Any updates on this @DawoudSheraz ? |
@nizarmah Hello. Apologies for the delay. I have asked for another internal review. Once that approval is in, I will be merging and releasing this change. Thanks |
@nizarmah Hello. Thanks for your wait on this review. There has been some internal discussion around the changes involved in this PR. Though the change is risky, it is ok to merge behind a django settings/waffle flag(your choice). The decision is that this change will not be enabled on edX but openedX community is free to enable it on their installations. |
@DawoudSheraz no problem 😃 Glad things worked out at the end. I really appreciate the decision to enable this behind a django settings/waffle flag ❤️ I'll need some time to do that change, since my plate is pretty full at the moment, but I'll let you know when the change is ready 👍🏼 The change should be ready by January 4, approximately. Sorry for that long delay! 🙁 |
No worries, take your time. Let me know when this is ready. Thank You |
@DawoudSheraz hope you're having a nice holiday :) I just wanted to let you know that I am planning to use Please let me know if you'd like me to take a different approach or if you'd like me to directly just use I'll hold off starting the work on the django setting/waffle flag until I have your confirmation, because the edx-toggles library is deprecating namespaces, cf edx/edx-toggles#80. That will require correcting the import statement, since the I should be able to prepare the changes for the tests and the import transcript function, though. |
@nizarmah Hello. I am ok with edx-toggles or whatever method suits you best. I saw the PR in edx-toggles but it seems it is still in review. If that change is a breaking change, you can constraint edx-toggle version in edx-val for the time being. |
5353dd4
to
02b39c3
Compare
@DawoudSheraz Happy new year 🙂 I didn't come empty handed haha, the changes are ready for your review 👍🏼 I also updated the sandbox, in case you'd like to test, and made By the way, the tox tests fail for python 3.8 and django 3.0. This applies to all branches for the time being, so it wasn't caused by my changes. I might consider submitting a pull request to handle that upgrade, during my freetime, in case some other changes are more of a priority for edX than that one 👍🏼 Sorry that the changes are a lot, it's mainly due to the requirements and the tests. I also wanted to mention that the "exit cases" for the By the way, sorry about the force push when merging master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are good. Thanks for updating the PR. Two questions/commits to address before merging.
edxval/models.py
Outdated
|
||
# save the transcript file | ||
if file_data: | ||
with closing(file_data.open()) as transcript_content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the open(seek) call necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been there since #87 (click here to go to exact line)
So I was hesitant to change something that was already there.
But I changed it, as you suggested, and the tests are running fine. Also, my manual testing went fine... 🤔
There are two different cases here, based on whether **overriding existing transcripts is enabled**. | ||
|
||
If overriding existing transcripts is **disabled**: | ||
If a transcript was already uploaded/imported for a certain video, then importing that same video with | ||
a new transcript does not overwrite it. | ||
In this case: | ||
1. Import `sub_transcript_file_name` | ||
1. Import `ext_transcript_file_name` | ||
Since both have the same video id, then `ext_transcript_file_name` would not be imported, so the transcript | ||
would be the `sub_transcript_file_name`. | ||
|
||
If overriding existing transcripts is **enabled**: | ||
If a transcript was already uploaded/imported for a certain video, then importing the same video with a new | ||
transcript will overwrite it, if and only if the content is different. | ||
In this case, | ||
1. Import `sub_transcript_file_name` | ||
1. Import `ext_transcript_file_name` | ||
Since both have the same video id, and `ext_transcript_file_name` has different content, it would get | ||
imported, so the transcript would be the `ext_transcript_file_name`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great explanation 👍🏽
OVERRIDE_EXISTING_IMPORTED_TRANSCRIPTS = WaffleFlag( | ||
waffle_name('override_existing_imported_transcripts'), | ||
module_name=__name__, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add waffle/toggle documentation as per edx-toggles specs.
@DawoudSheraz I addressed your latest comments 👍🏼 By the way, this is one of the significant contributions I've made, so thanks for helping out with it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase and squash the commits. Thanks for your contribution.
aa7dbad
to
923097b
Compare
@DawoudSheraz I have rebased and squashed 👍🏼 Thanks a lot for your review and your time! |
@nizarmah 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future. |
@nizarmah Thanks for your contribution, effort, patience, and cooperation on this PR. I have drafted https://github.com/edx/edx-val/tree/1.4.5 against your changes. Once it is pushed to PyPI, the changes will become part of edx-platform with the next requirements update. |
@robrap I believe the requested documentation changes for Would you mind please confirming that? Or letting me know what changes would be required? |
Actually, I notice the problem now... It is: I'll create a pull request to fix that 👍🏼 |
This pull-request is a follow up pull request, related to this comment from a previous pull-request.
Every time we import a new transcript, the existing and the new transcripts are compared by computing their content hash and comparing it.
This helps us identify whether the transcript uploaded is a duplicate of an already existing transcript, or if it is a different transcript that is being uploaded even though a transcript for the same video already exists.
Previously, if a video transcript for a specific video already exists, the new transcript wouldn't get uploaded. Now, if the transcript is different than the already existing one for a certain video, it does get uploaded and overrides it.
JIRA tickets: SE-3520, OSPR-5117
Discussions: GitHub comment on previous pull-request
Sandbox URL:
Installation Instructions:
Testing instructions:
staff
/edx
.edxval.override_existing_imported_transcripts
flag.make requirements
make test
Author notes and concerns:
master
.Reviewers
Settings