-
-
Notifications
You must be signed in to change notification settings - Fork 975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kemonoparty] improve hash extraction #3531
Conversation
Why not just use diff --git a/gallery_dl/extractor/kemonoparty.py b/gallery_dl/extractor/kemonoparty.py
index 63e30841..f5b0e8bd 100644
--- a/gallery_dl/extractor/kemonoparty.py
+++ b/gallery_dl/extractor/kemonoparty.py
@@ -41,7 +41,10 @@ class KemonopartyExtractor(Extractor):
self._find_inline = re.compile(
r'src="(?:https?://(?:kemono|coomer)\.party)?(/inline/[^"]+'
r'|/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{64}\.[^"]+)').findall
- find_hash = re.compile("/[0-9a-f]{2}/[0-9a-f]{2}/([0-9a-f]{64})").match
+ find_hash = re.compile(r"/(?:"
+ r"[0-9a-f]{2}/[0-9a-f]{2}/([0-9a-f]{64})|"
+ r"attachments/\w+/\w+/\w+/([0-9a-f]{32}))"
+ ).match
generators = self._build_file_generators(self.config("files"))
duplicates = self.config("duplicates")
comments = self.config("comments")
@@ -88,7 +91,7 @@ class KemonopartyExtractor(Extractor):
match = find_hash(url)
if match:
- file["hash"] = hash = match.group(1)
+ file["hash"] = hash = match.group(1) or match.group(2)
if hash in hashes and not duplicates:
self.log.debug("Skipping %s (duplicate)", url)
continue or even diff --git a/gallery_dl/extractor/kemonoparty.py b/gallery_dl/extractor/kemonoparty.py
index 63e30841..b6105f21 100644
--- a/gallery_dl/extractor/kemonoparty.py
+++ b/gallery_dl/extractor/kemonoparty.py
@@ -41,7 +41,9 @@ class KemonopartyExtractor(Extractor):
self._find_inline = re.compile(
r'src="(?:https?://(?:kemono|coomer)\.party)?(/inline/[^"]+'
r'|/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{64}\.[^"]+)').findall
- find_hash = re.compile("/[0-9a-f]{2}/[0-9a-f]{2}/([0-9a-f]{64})").match
+ find_hash = re.compile(
+ r"/(?:[0-9a-f]{2}/[0-9a-f]{2}|attachments/\w+/\w+/\w+)"
+ r"/([0-9a-f]{32,})").match
generators = self._build_file_generators(self.config("files"))
duplicates = self.config("duplicates")
comments = self.config("comments") |
Yeah that's a lot simpler. Can't believe I forgot to use alternation first 🤦♂️ |
a3d75a0
to
f2ac5c6
Compare
- extract MD5 hash from URLs - extract MD5 and SHA256 hash from Discord URLs (kemono.party only) - minor optimization (do not call 'hashes.add' when 'duplicates' is true) - update tests accordingly Co-authored-by: Mike Fährmann <mike_faehrmann@web.de>
f2ac5c6
to
20d6194
Compare
I think it's just a coincidence that the |
That seems to be the case. The majority of old attachment URLs from
with the test URL being the odd one out and having an MD5 hash in it by coincidence. The same seems to be true for other subscribestar artists as well. |
I discovered that URL by pure chance while I was working on #3532. What are the odds of that? 🤔 |
I checked my files, only 80 files of 190k have Most likely, these authors upload an artwork to a booru site first, then download it from the site with the right click and upload it to another site (patreon, fanbox, subscribestar). More over the Additionally, |
Yeah, that's the biggest issue. I didn't know these old-styled URLs existed. |
This partially reverts commit 20d6194.
No description provided.