-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subtitle Downloading #1117
Subtitle Downloading #1117
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1117 +/- ##
==========================================
+ Coverage 68.65% 68.98% +0.33%
==========================================
Files 25 25
Lines 2249 2270 +21
Branches 441 441
==========================================
+ Hits 1544 1566 +22
+ Misses 522 521 -1
Partials 183 183
Continue to review full report at Codecov.
|
@bakshiutkarsha What is the status of this PR?! There is not WIP or draft... but nobody is assigned to review it?! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bakshiutkarsha Please secure that the the development process is respected properly. Missing the review assignement like all other kinds of thing going forgotten in the process (not putting draft, not reviewieng, forgotting to merge, forgotting to rebase properly, etc...) just slows down the whole process and creates overhead (often on my side).
Concerning the testing, the instructions were:
Here is how your test should looks like IMO: (1) you have a small wikicode for a video with subtitle (2) you send it to https://en.wikipedia.org/api/rest_v1/#/Transforms/post_transform_wikitext_to_html (3) you get the result rewrite it and verify that the track URL is proper.
So far I can see, (3) is there but not (1) and (2) which makes the whole test relatively weak. What will happen if for some reason the HTML generated by Mediawiki changes?! The answer is that it might go through unoticed!
@kelson42 I will make sure in future, these things won't affect the development timeline and the process is respected properly. |
src/util/saveArticles.ts
Outdated
let mediaDependencies: Array<{ url: string, path: string }> = []; | ||
let doc = domino.createDocument(html); | ||
const tmRet = await treatMedias(doc, mw, dump, articleId); | ||
const tmRet = await treatMedias(doc, mw, dump, articleId, downloader, zimCreator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure it's a good idea to carry downloader
and zimCreator
3 levels down into treatMedias
. Normally, we get a set of links back to upper level then handle them by iterating mediaDependencies here. This way we don't deal with zim file in treat*()
functions which follow OOP spirit. I belive the function called processArticleHtml
should know nothing about the entity like zim, so I feel there's better way to handle this ;)
Could we handle subtitles outside this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understood this correctly, I agree with point that downloader and zimcreator got dragged 3 level down, but trackEle is tightly coupled with videoEle, so IMO it makes sense to have them in a same place. But le me know if you suggest differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it does make sense to keep status quo because that would require so much changes to make the order here in processArticleHtml()
. Let's table this until we get a resources to catch up a technical debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reopen this because I'm not happy about that either. We need to architecture this in a smarter way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the change of dragging downloader
and zimcreator
to 3 level down and handled this in different way as suggested.
9ea0be2
to
f8297c4
Compare
src/util/saveArticles.ts
Outdated
let mediaDependencies: Array<{ url: string, path: string }> = []; | ||
let doc = domino.createDocument(html); | ||
const tmRet = await treatMedias(doc, mw, dump, articleId); | ||
const tmRet = await treatMedias(doc, mw, dump, articleId, downloader, zimCreator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reopen this because I'm not happy about that either. We need to architecture this in a smarter way.
BTW, here's the fix for the bug that we're probably facing on v14 |
Nice, happy if the bug is not on our side... and happy again to see this will be part of next v14 release (my guess). |
7d71384
to
e40387f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minorities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've no more feedback here at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI must pass
27545a5
to
0e391a5
Compare
test/unit/s3.test.ts
Outdated
@@ -30,7 +30,7 @@ _test('S3 checks', async(t) => { | |||
const imageExist = await s3.downloadIfPossible('bm.wikipedia.org/static/images/project-logos/bmwiki-2x.png', 'https://bm.wikipedia.org/static/images/project-logos/bmwiki-2x.png'); | |||
t.assert(!!imageExist, 'Image exists in S3'); | |||
// Checking the data related to image matches | |||
t.equals(imageExist.headers.Metadata.etag, '"aeff-54a391a807034"', 'Etag matches'); | |||
t.equals(imageExist.headers.Metadata.etag, '"a740-5a6b0464619c2"', 'Etag matches'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really weak... and consequently has been broken only a few months after been implemented. Automated test should not have the this ETAG hardcoded. If the whole lifecycle has been tested properly you should not need to do that. Should be pretty easy to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this case, because lifecycle is getting tested.
@bakshiutkarsha It needs a few changes and the CI shoudl pass. |
a025e0a
to
5f26b54
Compare
test('treat multiple subtitles in one video', async(t) => { | ||
const { downloader, mw, dump } = await setupScrapeClasses({ format: '' }); | ||
|
||
// Wikicode is taken from article "User:Charliechlorine/sandbox" which has multiple(4) subtitles in this video |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last comment is confusing:
- 'treat one subtitle' should test exactly which URL
- 'treat multiple subtitles in one video' should do the same
- I would add an global test, testing the whole HTML testHtmlRewritingE2e
@bakshiutkarsha Can you please rebase on |
5f26b54
to
ddb15cb
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should have one test for treatSubtitle()
, one test for treatVideo()
and one E2E test for a video wiki text using testHtmlRewritingE2e()
.
Yout current usage of testHtmlRewritingE2e()
for subtitles is useless because you test it with the value of your own code... so this is probably always going to work. A unit test should test against hardcoded values (so here HTML)
The HTML I am getting from the API is not exactly same as the online version of it. For eg. API gives me HTML starting from |
should resolve #877