Skip to content

Commit

Permalink
[extractor/common] add generic support for akamai http format extraction
Browse files Browse the repository at this point in the history
  • Loading branch information
remitamine committed Nov 22, 2020
1 parent c4cabf0 commit 193422e
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions youtube_dl/extractor/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -2596,6 +2596,7 @@ def _media_formats(src, cur_media_type, type_info={}):

def _extract_akamai_formats(self, manifest_url, video_id, hosts={}):
formats = []

hdcore_sign = 'hdcore=3.7.0'
f4m_url = re.sub(r'(https?://[^/]+)/i/', r'\1/z/', manifest_url).replace('/master.m3u8', '/manifest.f4m')
hds_host = hosts.get('hds')
Expand All @@ -2608,13 +2609,39 @@ def _extract_akamai_formats(self, manifest_url, video_id, hosts={}):
for entry in f4m_formats:
entry.update({'extra_param_to_segment_url': hdcore_sign})
formats.extend(f4m_formats)

m3u8_url = re.sub(r'(https?://[^/]+)/z/', r'\1/i/', manifest_url).replace('/manifest.f4m', '/master.m3u8')
hls_host = hosts.get('hls')
if hls_host:
m3u8_url = re.sub(r'(https?://)[^/]+', r'\1' + hls_host, m3u8_url)
formats.extend(self._extract_m3u8_formats(
m3u8_url, video_id, 'mp4', 'm3u8_native',
m3u8_id='hls', fatal=False))

http_host = hosts.get('http')
if http_host and 'hdnea=' not in manifest_url:
REPL_REGEX = r'https://[^/]+/i/([^,]+),([^/]+),([^/]+).csmil/.+'
qualities = re.match(REPL_REGEX, m3u8_url).group(2).split(',')
qualities_length = len(qualities)
if len(formats) in (qualities_length + 1, qualities_length * 2 + 1):
i = 0
http_formats = []
for f in formats:
if f['protocol'] == 'm3u8_native' and f['vcodec'] != 'none':
for protocol in ('http', 'https'):
http_f = f.copy()
del http_f['manifest_url']
http_url = re.sub(
REPL_REGEX, protocol + r'://%s/\1%s\3' % (http_host, qualities[i]), f['url'])
http_f.update({
'format_id': http_f['format_id'].replace('hls-', protocol + '-'),
'url': http_url,
'protocol': protocol,
})
http_formats.append(http_f)
i += 1
formats.extend(http_formats)

return formats

def _extract_wowza_formats(self, url, video_id, m3u8_entry_protocol='m3u8_native', skip_protocols=[]):
Expand Down

8 comments on commit 193422e

@nixxo
Copy link
Contributor

@nixxo nixxo commented on 193422e Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remitamine
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there aren't enough information(examples of the manifest URLs and the direct mp4 URLs) to try to test and know what is the problem and what needs to change(either here or in your extractor).
if you can't share those information, i would suggest to look at the SkyIt extractor where it calls these method(from my tests the code works fine), and compare it the execution to your code and the format URLs that your testing on.

@nixxo
Copy link
Contributor

@nixxo nixxo commented on 193422e Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the usage of extract_akamai in the skyit extractor and that gave me the idea to use it the extractor I've written where i manually generate the mp4 direct urls.

For example a sky manifest that works is this:
https://videodemand-vh.akamaihd.net/i/encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_,web_low,web_med,web_high,web_hd,.mp4.csmil/index_0_av.m3u8?null=0

Instead the manifest that creates problems in the extractor i'm writing is:
https://gediusod-vh.akamaihd.net/i/repubblicatv/file/2020/09/22/731397/731397-video-rrtv-,650,200,400,1200,1800,2500,3500,4500,-s200922_iacoboni_salvini.mp4.csmil/index_3_av.m3u8?null=0

more detail in here: blackjack4494/yt-dlc#273

@nixxo
Copy link
Contributor

@nixxo nixxo commented on 193422e Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, figured out what the problem is: the \1%s part works only if the %s part is not numeric.
If the qualities array is made of bitrate values if fucks up the re.sub becuase it becomes a non existend group reference like \11500.

so in the example in the skyit extractor the qualities are words like low, med, high etc.. but in most cases the qualities are bitrate values.

The not so elegant solution might be to do the steps one at a time like in the example I gave, but I'm still a python noob so, maybe ther'a a more elegant/easy solution.

@nixxo
Copy link
Contributor

@nixxo nixxo commented on 193422e Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remitamine solution found: replace \1 with \g<1>

https://stackoverflow.com/questions/5984633/python-re-sub-group-number-after-number

should I do a PR or you do the fix yourself?

@remitamine
Copy link
Collaborator Author

@remitamine remitamine commented on 193422e Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the solution i was about to make is:

diff --git a/youtube_dl/extractor/common.py b/youtube_dl/extractor/common.py
index 16aff885c..13e996c5b 100644
--- a/youtube_dl/extractor/common.py
+++ b/youtube_dl/extractor/common.py
@@ -2620,10 +2620,10 @@ class InfoExtractor(object):
 
         http_host = hosts.get('http')
         if http_host and 'hdnea=' not in manifest_url:
-            REPL_REGEX = r'https://[^/]+/i/([^,]+),([^/]+),([^/]+).csmil/.+'
+            REPL_REGEX = r'https://[^/]+/[iz]/([^,]+),([^/]+),([^/]+).csmil/.+'
             qualities = re.match(REPL_REGEX, m3u8_url).group(2).split(',')
             qualities_length = len(qualities)
-            if len(formats) in (qualities_length + 1, qualities_length * 2 + 1):
+            if len(formats) in (qualities_length, qualities_length + 1, qualities_length * 2, qualities_length * 2 + 1):
                 i = 0
                 http_formats = []
                 for f in formats:
@@ -2632,7 +2632,8 @@ class InfoExtractor(object):
                             http_f = f.copy()
                             del http_f['manifest_url']
                             http_url = re.sub(
-                                REPL_REGEX, protocol + r'://%s/\1%s\3' % (http_host, qualities[i]), f['url'])
+                                REPL_REGEX, protocol + r'://%s/\1%%s\3' % http_host, f['url'])
+                            http_url = http_url % qualities[i]
                             http_f.update({
                                 'format_id': http_f['format_id'].replace('hls-', protocol + '-'),
                                 'url': http_url,

but i guess the solution you found on StackOverflow may be better if it works on all Supported version(sometimes Python 2.6 causes problems with unsupported features of new versions).
the part changing REPL_REGEX is not relevent, it just to make f4m URLs work as well, actually it's not needed i forget that i'm using the m3u8 URL.

@remitamine
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's fine to use \g<1> as it's described in multiple place in the Python 2.6 re documentation.

@remitamine
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I do a PR or you do the fix yourself?

i'm going add the fix right now.

Please sign in to comment.