Add MegaCartoons extractor #30952

Daenges · 2022-05-16T18:57:42Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

MegaCartoons provides many cartoon series, such as SpongeBob or Ben10. As far as I found out, the site has no problems concerning copyright. This extractor allows the extraction of the video URL of single episodes.

dirkf

Thanks for your work!

Just a few points to consider.

dirkf · 2022-05-20T00:41:10Z

youtube_dl/extractor/megacartoons.py

+            'title': 'Help Wanted',
+            'ext': 'mp4',
+            'thumbnail': r're:^https?://.*\.jpg$',
+            'description': 'Help Wanted: Encouraged by his best friend, Patrick Starfish, '


Preferably for such a long field use 'md5:...'

dirkf · 2022-05-20T01:17:07Z

youtube_dl/extractor/megacartoons.py

+        video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage)             # Get the thumbnail
+
+        # Every video has a short summary -> save it as description
+        video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage)


If there may be a newline in the 'videodescription', the s (dotall) flag is needed.

The regex as proposed will match everything between the first <p> and the last </p> (in the test video there's only one <p> element, but that's not robust).

A named group isn't really required.

Suggested change

video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage)

article = self._search_regex(

r'(?s)<article\b[^>]*?\bclass\s*=\s*[^>]*?\bpost\b[^>]*>(.+?)</article\b', webpage, 'post', default='')

video_description = (

self._html_search_regex(r'(?s)<p>\s*([^<]+)\s*</p>', article 'videodescription', fatal=False)

or self._og_search_description(webpage))

The suggestion (untested) tries to get the <article> with class post. Then the first <p> element in the article's innerHTML is selected, with the description being its stripped text. An alternative could be to find all the <p> elements in the page and return the one that matches the start of the ld+json or og:description text.

youtube_dl/extractor/megacartoons.py

dirkf · 2022-05-20T01:23:36Z

youtube_dl/extractor/megacartoons.py

+        return {
+            'id': video_id,
+            'title': title,
+            'format': video_type,


This will normally be set automatically. You could apply mimetype2ext() from utils.py to the extracted video_type to get mp4 (in the test video), but that's just what yt-dl should set as format anyway -- and if you rely on ld+json video_type won't be available.

dirkf · 2022-05-20T01:27:18Z

youtube_dl/extractor/megacartoons.py

+        video_url = url_json.get('sources')[0].get('src') or self._og_search_video_url(webpage)    # Get the video url
+        video_type = url_json.get('sources')[0].get('type')                                        # Get the video type -> 'video/mp4'
+        video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage)             # Get the thumbnail


If this is retained, use url_or_none() from utils.py to condition the values:

Suggested change

video_url = url_json.get('sources')[0].get('src') or self._og_search_video_url(webpage) # Get the video url

video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4'

video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail

video_url = url_or_none(url_json.get('sources')[0].get('src')) or self._og_search_video_url(webpage) # Get the video url

video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4'

video_thumbnail = url_or_none(url_json.get('splash')) or self._og_search_thumbnail(webpage) # Get the thumbnail

(and from ..utils import url_or_none at the top).

Daenges · 2022-05-21T12:26:14Z

@dirkf Thank you for your feedback. I implemented every suggestion aside from self._search_json_ld(webpage, video_id, default={}). I do not quite get how this method works and as there is no direct documentation it is pretty difficult to find out.
When I try to implement that line, the info variable is an empty json.

dirkf · 2022-05-21T14:31:05Z

You may need to add expected_type='VideoObject' to the params.

Daenges · 2022-05-21T14:36:29Z

You may need to add expected_type='VideoObject' to the params.

With this:

    def _real_extract(self, url):
        # ID is equal to the episode name
        video_id = self._match_id(url)
        webpage = self._download_webpage(url, video_id)

        info = self._search_json_ld(webpage, video_id, expected_type='VideoObject', default={})

        raise Exception(json.dumps(info))

I am still getting this:

[MegaCartoons] help-wanted: Downloading webpage
E
======================================================================
ERROR: test_MegaCartoons (__main__.TestDownload):
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/daenges/Git/youtube-dl/test/test_download.py", line 158, in test_template
    res_dict = ydl.extract_info(
  File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 808, in extract_info
    return self.__extract_info(url, ie, download, extra_info, process)
  File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/home/daenges/Git/youtube-dl/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "/home/daenges/Git/youtube-dl/youtube_dl/extractor/megacartoons.py", line 31, in _real_extract
    raise Exception(json.dumps(info))
Exception: {}

----------------------------------------------------------------------
Ran 1 test in 1.155s

FAILED (errors=1)

- Verify description through md5 - Implement robust detection of description - Remove format attribute to allow auto detection - Allow conditioning of URLs

Daenges · 2022-05-22T08:46:34Z

@dirkf do you have any further suggestions to get self._search_json_ld() to work?

dirkf · 2022-05-22T12:50:57Z

The ld+json structure uses @graph which isn't supported yet. The @context from the top-level is meant to be used as a default in the child nodes but json_ld() currently just skips nodes with no @context.

Although a few tweaks to the parser fix this, I'd rather back-port the fancier yt-dlp code, which looks like it should already handle this and some other advanced structures, in due course.

Meanwhile let's just specialise _search_json_ld() in this extractor for now, like the attached.

megacartoons.py.txt

Daenges · 2022-05-23T17:30:55Z

@dirkf Sooo... I copy pasted your code and the test passes. If you have no further suggestions, this PR is ready for merge.

Did not contribute that much in the end. ._.
But tbh it is rather difficult without a real documentation. :/

mohit83k · 2022-06-29T11:58:47Z

It's very interesting pull request. Hope you can merge it soon.

dirkf

Let's see if I can force the CI test.

youtube_dl/extractor/megacartoons.py

CHJ85 · 2023-07-05T12:31:43Z

You forgot to add support for numbers and uppercase letters, preventing a handful of episodes from working:
'https?://(?:www.)?megacartoons.net/(?P[a-zA-Z0-9-]+)/'

dirkf · 2023-07-05T12:36:36Z

Can you suggest test URLs?

CHJ85 · 2023-07-05T13:24:12Z

@dirkf This url, for one: https://www.megacartoons.net/1000-years-of-courage/
or https://www.megacartoons.net/911-2/

youtube_dl/extractor/megacartoons.py

Daenges added 3 commits May 16, 2022 21:28

Add extractor for MegaCartoons

3bffd20

Add further comments

2b128c7

Apply coding conventions

041abb9

dirkf requested changes May 20, 2022

View reviewed changes

Daenges force-pushed the megacartoons branch from 0b5477e to 73771cd Compare May 21, 2022 12:41

Commit suggested changes.

73771cd

- Verify description through md5 - Implement robust detection of description - Remove format attribute to allow auto detection - Allow conditioning of URLs

Implement _search_json_ld()

cf4a829

dirkf marked this pull request as draft July 22, 2022 14:08

dirkf marked this pull request as ready for review July 22, 2022 14:09

dirkf requested changes Jul 22, 2022

View reviewed changes

youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved

youtube_dl/extractor/megacartoons.py Show resolved Hide resolved

youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved

youtube_dl/extractor/megacartoons.py Show resolved Hide resolved

Improve fallback

e8aca87

dirkf self-requested a review July 22, 2022 14:50

dirkf reviewed Jul 22, 2022

View reviewed changes

youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved

Improve fallback

2804216

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf reviewed Jul 5, 2023

View reviewed changes

youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved

dirkf reviewed Jul 5, 2023

View reviewed changes

youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved

Allow numbers and upper-case letters in ID

4232427

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MegaCartoons extractor #30952

Add MegaCartoons extractor #30952

Daenges commented May 16, 2022

dirkf left a comment

dirkf May 20, 2022

dirkf May 20, 2022

dirkf May 20, 2022

dirkf May 20, 2022

Daenges commented May 21, 2022

dirkf commented May 21, 2022

Daenges commented May 21, 2022 •

edited

Loading

Daenges commented May 22, 2022

dirkf commented May 22, 2022

Daenges commented May 23, 2022

mohit83k commented Jun 29, 2022

dirkf left a comment

CHJ85 commented Jul 5, 2023

dirkf commented Jul 5, 2023

CHJ85 commented Jul 5, 2023

-        video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage)
+        article = self._search_regex(
+            r'(?s)<article\b[^>]*?\bclass\s*=\s*[^>]*?\bpost\b[^>]*>(.+?)</article\b', webpage, 'post', default='')
+        video_description = (
+            self._html_search_regex(r'(?s)<p>\s*([^<]+)\s*</p>', article 'videodescription', fatal=False)
+            or self._og_search_description(webpage))

Add MegaCartoons extractor #30952

Are you sure you want to change the base?

Add MegaCartoons extractor #30952

Conversation

Daenges commented May 16, 2022

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf left a comment

Choose a reason for hiding this comment

dirkf May 20, 2022

Choose a reason for hiding this comment

dirkf May 20, 2022

Choose a reason for hiding this comment

dirkf May 20, 2022

Choose a reason for hiding this comment

dirkf May 20, 2022

Choose a reason for hiding this comment

Daenges commented May 21, 2022

dirkf commented May 21, 2022

Daenges commented May 21, 2022 • edited Loading

Daenges commented May 22, 2022

dirkf commented May 22, 2022

Daenges commented May 23, 2022

mohit83k commented Jun 29, 2022

dirkf left a comment

Choose a reason for hiding this comment

CHJ85 commented Jul 5, 2023

dirkf commented Jul 5, 2023

CHJ85 commented Jul 5, 2023

Daenges commented May 21, 2022 •

edited

Loading