Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_extract_akamai_formats weird problem #273

Closed
5 of 6 tasks
nixxo opened this issue Dec 2, 2020 · 4 comments
Closed
5 of 6 tasks

_extract_akamai_formats weird problem #273

nixxo opened this issue Dec 2, 2020 · 4 comments

Comments

@nixxo
Copy link
Contributor

nixxo commented Dec 2, 2020

Checklist

  • I'm reporting a broken site support issue
  • I've verified that I'm running youtube-dlc version 2020.10.31
  • I've checked that all provided URLs are alive and playable in a browser
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched the bugtracker for similar bug reports including closed ones
  • I've read bugs section in FAQ

Verbose log

Traceback (most recent call last):
  File "C:\Users\Utente\scoop\apps\python\current\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Utente\scoop\apps\python\current\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\__main__.py", line 19, in <module>
    youtube_dlc.main()
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\__init__.py", line 488, in main
    _real_main(argv)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\__init__.py", line 478, in _real_main
    retcode = ydl.download(all_urls)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 2130, in download
    res = self.extract_info(
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 841, in extract_info
    return self.__extract_info(url, ie, download, extra_info, process, info_dict)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 849, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 870, in __extract_info
    ie_result = ie.extract(url)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\extractor\common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\extractor\gedi.py", line 38, in _real_extract
    formats = self._extract_akamai_formats(
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\extractor\common.py", line 2645, in _extract_akamai_formats
    http_url = re.sub(
  File "C:\Users\Utente\scoop\apps\python\current\lib\re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Utente\scoop\apps\python\current\lib\re.py", line 327, in _subx
    template = _compile_repl(template, pattern)
  File "C:\Users\Utente\scoop\apps\python\current\lib\re.py", line 318, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "C:\Users\Utente\scoop\apps\python\current\lib\sre_parse.py", line 1036, in parse_template
    addgroup(int(this[1:]), len(this) - 1)
  File "C:\Users\Utente\scoop\apps\python\current\lib\sre_parse.py", line 980, in addgroup
    raise s.error("invalid group reference %d" % index, pos)
re.error: invalid group reference 11 at position 29

Description

i'm experimenting a bit with an extractor and I'm trying to use _extract_akamai_formats in common.py
it basically takes the hls manifest url to recreate the http direct url for the mp4 of the file.

but it seems that some m3u8 manifest creates some problem with the re.sub function that I cannot understand.

the line that generates the problem is this one:

http_url = re.sub( REPL_REGEX, protocol + r'://%s/\1%s\3' % ( http_host, qualities[i] ), f['url'] )

but if I recreate every step of the same line the code is executed without problems.

reg = re.search(REPL_REGEX, f['url'])
g1 = reg.group(1)
g3 = reg.group(3)
http_url = protocol + '://%s/%s%s%s' % (http_host, g1, qualities[i], g3)

reading the traceback log is seems to me that it's a problem with the regex library. Can somebody explain it to me?

@pukkandan
Copy link
Contributor

please give an example with the url and value of variables http_host and qualities[i].

Without any additional info, my guess is that the variables have \ somewhere which is being interpreted by the regex as a reference

@nixxo
Copy link
Contributor Author

nixxo commented Dec 2, 2020

sry for the little infos... here's more:

so, the manifest url that works is

https://videodemand-vh.akamaihd.net/i/encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_,web_low,web_med,web_high,web_hd,.mp4.csmil/index_0_av.m3u8?null=0
and using the REPL_REGEX it "extracts" the tuple

#0 tuple(3)
    [0] => str(72) "encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_"
    [1] => str(31) "web_low,web_med,web_high,web_hd"
    [2] => str(4) ".mp4"

generating the mp4 direct url

http://videoplatform.sky.it/encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_web_low.mp4

Instead the manifest that creates problem is:
https://gediusod-vh.akamaihd.net/i/repubblicatv/file/2020/09/22/731397/731397-video-rrtv-,650,200,400,1200,1800,2500,3500,4500,-s200922_iacoboni_salvini.mp4.csmil/index_3_av.m3u8?null=0
that generates the tuple

#0 tuple(3)
    [0] => str(54) "repubblicatv/file/2020/09/22/731397/731397-video-rrtv-"
    [1] => str(36) "650,200,400,1200,1800,2500,3500,4500"
    [2] => str(29) "-s200922_iacoboni_salvini.mp4"

and the resulting mp4 url is wrong:
http://media.gedidigital.it/J00-s200922_iacoboni_salvini.mp4

but, like I said, if I do the same oparation just one step at a time it works. Only in the "condensed" way it generates problems.

@nixxo
Copy link
Contributor Author

nixxo commented Dec 2, 2020

ok, figured out the problem.

the replacement is
r'://%s/\1%s\3' % ( http_host, qualities[i] )
but if qualities is a number it is a problem because it becomes attatched to the \1 and becomes \1number and it fucks up the regex.

@nixxo
Copy link
Contributor Author

nixxo commented Dec 2, 2020

ok, solution found: ytdl-org/youtube-dl@193422e#commitcomment-44741426

instead of \1 use \g<1>

@nixxo nixxo closed this as completed Dec 2, 2020
siikamiika pushed a commit to siikamiika/yt-dlc that referenced this issue Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants