Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MegaCartoons extractor #30952

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

Daenges
Copy link

@Daenges Daenges commented May 16, 2022

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

MegaCartoons provides many cartoon series, such as SpongeBob or Ben10. As far as I found out, the site has no problems concerning copyright. This extractor allows the extraction of the video URL of single episodes.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

Just a few points to consider.

'title': 'Help Wanted',
'ext': 'mp4',
'thumbnail': r're:^https?://.*\.jpg$',
'description': 'Help Wanted: Encouraged by his best friend, Patrick Starfish, '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably for such a long field use 'md5:...'

video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail

# Every video has a short summary -> save it as description
video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If there may be a newline in the 'videodescription', the s (dotall) flag is needed.
  • The regex as proposed will match everything between the first <p> and the last </p> (in the test video there's only one <p> element, but that's not robust).
  • A named group isn't really required.
Suggested change
video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage)
article = self._search_regex(
r'(?s)<article\b[^>]*?\bclass\s*=\s*[^>]*?\bpost\b[^>]*>(.+?)</article\b', webpage, 'post', default='')
video_description = (
self._html_search_regex(r'(?s)<p>\s*([^<]+)\s*</p>', article 'videodescription', fatal=False)
or self._og_search_description(webpage))

The suggestion (untested) tries to get the <article> with class post. Then the first <p> element in the article's innerHTML is selected, with the description being its stripped text. An alternative could be to find all the <p> elements in the page and return the one that matches the start of the ld+json or og:description text.

youtube_dl/extractor/megacartoons.py Show resolved Hide resolved
return {
'id': video_id,
'title': title,
'format': video_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will normally be set automatically. You could apply mimetype2ext() from utils.py to the extracted video_type to get mp4 (in the test video), but that's just what yt-dl should set as format anyway -- and if you rely on ld+json video_type won't be available.

Comment on lines 39 to 41
video_url = url_json.get('sources')[0].get('src') or self._og_search_video_url(webpage) # Get the video url
video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4'
video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is retained, use url_or_none() from utils.py to condition the values:

Suggested change
video_url = url_json.get('sources')[0].get('src') or self._og_search_video_url(webpage) # Get the video url
video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4'
video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail
video_url = url_or_none(url_json.get('sources')[0].get('src')) or self._og_search_video_url(webpage) # Get the video url
video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4'
video_thumbnail = url_or_none(url_json.get('splash')) or self._og_search_thumbnail(webpage) # Get the thumbnail

(and from ..utils import url_or_none at the top).

@Daenges
Copy link
Author

Daenges commented May 21, 2022

@dirkf Thank you for your feedback. I implemented every suggestion aside from self._search_json_ld(webpage, video_id, default={}). I do not quite get how this method works and as there is no direct documentation it is pretty difficult to find out.
When I try to implement that line, the info variable is an empty json.

@dirkf
Copy link
Contributor

dirkf commented May 21, 2022

You may need to add expected_type='VideoObject' to the params.

@Daenges
Copy link
Author

Daenges commented May 21, 2022

You may need to add expected_type='VideoObject' to the params.

With this:

    def _real_extract(self, url):
        # ID is equal to the episode name
        video_id = self._match_id(url)
        webpage = self._download_webpage(url, video_id)

        info = self._search_json_ld(webpage, video_id, expected_type='VideoObject', default={})

        raise Exception(json.dumps(info))

I am still getting this:

[MegaCartoons] help-wanted: Downloading webpage
E
======================================================================
ERROR: test_MegaCartoons (__main__.TestDownload):
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/daenges/Git/youtube-dl/test/test_download.py", line 158, in test_template
    res_dict = ydl.extract_info(
  File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 808, in extract_info
    return self.__extract_info(url, ie, download, extra_info, process)
  File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/home/daenges/Git/youtube-dl/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "/home/daenges/Git/youtube-dl/youtube_dl/extractor/megacartoons.py", line 31, in _real_extract
    raise Exception(json.dumps(info))
Exception: {}

----------------------------------------------------------------------
Ran 1 test in 1.155s

FAILED (errors=1)

- Verify description through md5
- Implement robust detection of description
- Remove format attribute to allow auto detection
- Allow conditioning of URLs
@Daenges
Copy link
Author

Daenges commented May 22, 2022

@dirkf do you have any further suggestions to get self._search_json_ld() to work?

@dirkf
Copy link
Contributor

dirkf commented May 22, 2022

The ld+json structure uses @graph which isn't supported yet. The @context from the top-level is meant to be used as a default in the child nodes but json_ld() currently just skips nodes with no @context.

Although a few tweaks to the parser fix this, I'd rather back-port the fancier yt-dlp code, which looks like it should already handle this and some other advanced structures, in due course.

Meanwhile let's just specialise _search_json_ld() in this extractor for now, like the attached.

megacartoons.py.txt

@Daenges
Copy link
Author

Daenges commented May 23, 2022

@dirkf Sooo... I copy pasted your code and the test passes. If you have no further suggestions, this PR is ready for merge.

Did not contribute that much in the end. ._.
But tbh it is rather difficult without a real documentation. :/

@mohit83k
Copy link

It's very interesting pull request. Hope you can merge it soon.

@dirkf dirkf marked this pull request as draft July 22, 2022 14:08
@dirkf dirkf marked this pull request as ready for review July 22, 2022 14:09
Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if I can force the CI test.

youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved
youtube_dl/extractor/megacartoons.py Show resolved Hide resolved
youtube_dl/extractor/megacartoons.py Outdated Show resolved Hide resolved
youtube_dl/extractor/megacartoons.py Show resolved Hide resolved
@dirkf dirkf self-requested a review July 22, 2022 14:50
@CHJ85
Copy link

CHJ85 commented Jul 5, 2023

You forgot to add support for numbers and uppercase letters, preventing a handful of episodes from working:
'https?://(?:www.)?megacartoons.net/(?P[a-zA-Z0-9-]+)/'

@dirkf
Copy link
Contributor

dirkf commented Jul 5, 2023

Can you suggest test URLs?

@CHJ85
Copy link

CHJ85 commented Jul 5, 2023

@dirkf This url, for one: https://www.megacartoons.net/1000-years-of-courage/
or https://www.megacartoons.net/911-2/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants