Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cnnturk] Add new extractor #32671

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

[cnnturk] Add new extractor #32671

wants to merge 2 commits into from

Conversation

erendn
Copy link

@erendn erendn commented Dec 22, 2023

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

CNN Turk is one of the most popular TV channels in Turkey. Their website contains recordings of their livestreams dating as early as 2007. I wanted to add youtube-dl support for this website.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your work! And apologies that it's been in the stack so long.

I don't think this site has a corresponding Support Request issue, so please create one, or, maybe easier, paste the completed template that you would have submitted as a comment here.

I've made some suggestions and also applied one to make the tests more compact. The main point of that was to trigger the CI tests, which seemed to have been turned off for the PR. As to the actual change, apart from layout, it's tricky to test description fields, which may be quite long. If you use the MD5 of the value, the test value becomes short but you can't tell how the value has changed if the test starts failing; so I prefer a regular expression test.

Comment on lines +8 to +20
_VALID_URL = r'''(?x)
https?://
(?:www\.)?cnnturk\.com/
(?:
tv-cnn-turk/programlar/|
video/|
turkiye/|
dunya/|
ekonomi/
)
(?:[^/]+/)*
(?P<id>[^/?#&]+)
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferable layout:

Suggested change
_VALID_URL = r'''(?x)
https?://
(?:www\.)?cnnturk\.com/
(?:
tv-cnn-turk/programlar/|
video/|
turkiye/|
dunya/|
ekonomi/
)
(?:[^/]+/)*
(?P<id>[^/?#&]+)
'''
_VALID_URL = r'''(?x)
https?://(?:www\.)?cnnturk\.com/
(?:
tv-cnn-turk/programlar|
video|turkiye|dunya|ekonomi
)/
(?:[^/]+/)*(?P<id>[^/?#&]+)
'''

video_url = video_info['MediaFiles'][0]['Path']
if not video_url.startswith("http"):
video_url = 'https://cnnvod.duhnet.tv/' + video_url
extension = 'mp4' if video_url.endswith('mp4') else 'm3u8'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use utils.determine_ext():

Suggested change
extension = 'mp4' if video_url.endswith('mp4') else 'm3u8'
extension = determine_ext(video_url)

Comment on lines +82 to +83
if not video_url.startswith("http"):
video_url = 'https://cnnvod.duhnet.tv/' + video_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use utils.urljoin():

Suggested change
if not video_url.startswith("http"):
video_url = 'https://cnnvod.duhnet.tv/' + video_url
video_url = urljoin('https://cnnvod.duhnet.tv/', video_url)

Comment on lines +75 to +79
# Video info is a JSON object inside a script tag
video_info = self._parse_json(
self._search_regex(
r'({"Ancestors":.+?\);)', webpage, 'stream')[:-2],
video_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe they'll send the JS in a different order. I think a pattern like
r'script[^>]*> [\w\s=]\(\s*function\s*\([\w\s]+\)\s*\{[^})]*\}\s*\)\s*\(\s*(\{.+?\})\s*\)[;\s]*</script' would be safer.

Once #32725 is merged, you can use _search_json(), which tries to get the JSON even if the search pattern matches extra text after the wanted JSON.

r'({"Ancestors":.+?\);)', webpage, 'stream')[:-2],
video_id)

video_url = video_info['MediaFiles'][0]['Path']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a media_files too, with the same list when I checked. Using utils.traverse_obj() you can get the URL from either, in case one or the other is missing:

Suggested change
video_url = video_info['MediaFiles'][0]['Path']
video_url = traverse_obj(video_info, (
('MediaFiles', 'media_files), 0, 'Path'), get_all=False)

Comment on lines +85 to +89
formats = [{
'url': video_url,
'ext': extension,
'language': 'tr',
}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extract the formats from a m3u8 manifest with this pattern:

Suggested change
formats = [{
'url': video_url,
'ext': extension,
'language': 'tr',
}]
if extension = 'm3u8':
formats = self._extract_m3u8_formats(
video_url, video_id, ext='mp4',
entry_protocol='m3u8_native', # this is probably OK but needs to be tested
fatal=False)
else:
formats = [{
'url': video_url,
'ext': extension,
}]
for f in formats:
f.setdefault('language', 'tr')

}]

return {
'id': video_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be display_id and video_id should be the UUID that is value of video_data['_Id'] and is also the value of the id and data-id attributes in a <div> with class player-container. Don 't you think?

Comment on lines +96 to +97
'release_date': video_info['published_date'],
'upload_date': video_info['created_date'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are optional so the code shouldn't crash if the respective field is missing. Also,
there is a time field too that can be used. Define an inline function using utils.parse_iso8601() (ISO 8601 dates are like yyyy-mm-ddThh:mm:ss, roughly):

        
        def get_datetime(v, when):
            dt = v.get(when + '_date', '').strip()
            if not dt.isdigit() or len(dt) < 8:  # year 10000 bug!
                return
            dt = '-'.join((dt[:4], dt[4:6], dt[6:]))
            return parse_iso8601(
                'T'.join((dt, v.get(when + '_time', '0:00:00'))))

Then:

Suggested change
'release_date': video_info['published_date'],
'upload_date': video_info['created_date'],
'release_date': get_datetime(video_info, 'published')
'upload_date': get_datetime(video_info, 'created_date')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants