[2chen] add '.club' support #3406

enduser420 · 2022-12-14T13:42:24Z

No description provided.

enduser420 · 2022-12-14T13:47:49Z

should I also remove the dl parameter from the urls?

$ gdl -g https://2chen.club/tv/1
https://2chen.club/assets/images/src/1f4ff1f76672b796fd2d2e119595228b8267847f.webm?dl=teehee.webm
https://2chen.club/assets/images/src/cdc03ac39210e3127328bb4c7afb9c3643908bc8.jpg?dl=mpv-shot0008.jpg       
https://2chen.club/assets/images/src/c13d3c88a492af14ad9c2b4976e3848a21330be6.webm?dl=1597803090690.webm    
https://2chen.club/assets/images/src/2a96b761e5e9370583461159902e946a8c0995ee.mp4?dl=P7ehIvVrPB3uNa0u.mp4   
https://2chen.club/assets/images/src/8718b9ca3b425c2c27e1d72b9b9ac3962c23f329.mp4?dl=6865240778499788034.mp4

mikf · 2022-12-15T11:01:40Z

I wouldn't bother as long as these URLs just "work", but you can if you want to. I don't really mind.

If you want an absolute decision from me: No, leave them as they are.
But, again, do what you feel is right.

enduser420 · 2022-12-15T12:05:54Z

lets just leave it as it is then

chenmoe · 2022-12-15T13:19:40Z

gallery-dl/gallery_dl/extractor/2chen.py

Line 46 in a074f6f

post["url"] = self.root + post["url"]

this is no longer needed

should I also remove the dl parameter from the urls?

Yes, it's better to use the url with no query params at all. Maybe post["url"] = post["url"].split("?")[0]

enduser420 · 2022-12-15T14:20:09Z

maybe its time for some refactor

diff --git a/gallery_dl/extractor/2chen.py b/gallery_dl/extractor/2chen.py
index 2cf8b305..3dcb00bd 100644
--- a/gallery_dl/extractor/2chen.py
+++ b/gallery_dl/extractor/2chen.py
@@ -17,7 +17,7 @@ class _2chenThreadExtractor(Extractor):
     directory_fmt = ("{category}", "{board}", "{thread} {title}")
     filename_fmt = "{time} {filename}.{extension}"
     archive_fmt = "{board}_{thread}_{hash}_{time}"
-    pattern = r"(?:https?://)?2chen\.(?:moe|club)/([^/?#]+)/(\d+)"
+    pattern = r"((?:https?://)?2chen\.(?:moe|club))/([^/?#]+)/(\d+)"
     test = (
         ("https://2chen.moe/tv/496715", {
             "count": ">= 179",
@@ -31,47 +31,37 @@ class _2chenThreadExtractor(Extractor):
 
     def __init__(self, match):
         Extractor.__init__(self, match)
-        self.root = text.root_from_url(match.group(0))
-        self.board, self.thread = match.groups()
+        root, self.board, self.thread = match.groups()
+        self.root = text.ensure_http_scheme(root)
 
     def items(self):
         url = "{}/{}/{}".format(self.root, self.board, self.thread)
         page = self.request(url, encoding="utf-8", notfound="thread").text
-        data = self.metadata(page)
-        yield Message.Directory, data
-        for post in self.posts(page):
-            if not post["url"]:
-                continue
-            post.update(data)
-            post["url"] = self.root + post["url"]
-            post["time"] = text.parse_int(post["date"].timestamp())
-            yield Message.Url, post["url"], text.nameext_from_url(
-                post["filename"], post)
-
-    def metadata(self, page):
         board, pos = text.extract(page, 'class="board">/', '/<')
-        title = text.extract(page, "<h3>", "</h3>", pos)[0]
-        return {
+        title, _ = text.extract(page, "<h3>", "</h3>", pos)
+        data = {
             "board" : board,
             "thread": self.thread,
-            "title" : text.unescape(title),
+            "title" : text.unescape(title)[:50],
         }
 
-    def posts(self, page):
-        """Return iterable with relevant posts"""
-        return map(self.parse, text.extract_iter(
-            page, 'class="glass media', '</article>'))
+        yield Message.Directory, data
+        for post in text.extract_iter(
+                page, 'class="glass media', '</article>'):
+            post = self.parse(post)
+            post.update(data)
+            post["time"] = text.parse_int(post["date"].timestamp())
+            yield Message.Url, post["url"], text.nameext_from_url(
+                post["filename"], post)
 
     def parse(self, post):
         extr = text.extract_from(post)
         return {
             "name"    : text.unescape(extr("<span>", "</span>")),
             "date"    : text.parse_datetime(
-                extr("<time", "<").partition(">")[2],
-                "%d %b %Y (%a) %H:%M:%S"
-            ),
+                extr("<time>", "</time>"), "%d %b %Y (%a) %H:%M:%S"),
             "no"      : extr('href="#p', '"'),
-            "url"     : extr('</a><a href="', '"'),
+            "url"     : extr('</a><a href="', '?dl='),
             "filename": text.unescape(extr('download="', '"')),
             "hash"    : extr('data-hash="', '"'),
         }
@@ -81,7 +71,7 @@ class _2chenBoardExtractor(Extractor):
     """Extractor for 2chen boards"""
     category = "2chen"
     subcategory = "board"
-    pattern = r"(?:https?://)?2chen\.(?:moe|club)/([^/?#]+)(?:/catalog|/?$)"
+    pattern = r"((?:https?://)?2chen\.(?:moe|club))/([^/?#]+)(?:/catalog|/?$)"
     test = (
         ("https://2chen.moe/co/", {
             "pattern": _2chenThreadExtractor.pattern
@@ -93,8 +83,8 @@ class _2chenBoardExtractor(Extractor):
 
     def __init__(self, match):
         Extractor.__init__(self, match)
-        self.root = text.root_from_url(match.group(0))
-        self.board = match.group(1)
+        root, self.board = match.groups()
+        self.root = text.ensure_http_scheme(root)
 
     def items(self):
         url = "{}/{}/catalog".format(self.root, self.board)

mikf · 2022-12-15T16:48:52Z

Yes, it's better to use the url with no query params at all

But why? A browser would send those query parameters as well, so what other reason except saving a few extra bytes per request would there be?

maybe its time for some refactor

The only thing I would potentially change is inherit from a common base class so category and the root_from_url part can be shared. I actually don't really like most of the changes you proposed, but that should into a different PR anyway.

+            "url"     : extr('</a><a href="', '?dl='),

What if the URLs change and ?dl= gets removed? I'd rather extract URLs in full and process them afterwards.

+    pattern = r"((?:https?://)?2chen\.(?:moe|club))/([^/?#]+)(?:/catalog|/?$)"
...
+        root, self.board = match.groups()

There's no need capture the entire URL a second time just to save one line of code.

I'm just going to merge this as is for now. Anything else should go into a separate PR.

enduser420 · 2022-12-15T17:43:39Z

don't really like most of the changes

I remove those functions to be same as the other chan extractors

[2chen] add '.club' support

a074f6f

mikf merged commit a2be06d into mikf:master Dec 15, 2022

enduser420 deleted the extractor/2chen branch December 15, 2022 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2chen] add '.club' support #3406

[2chen] add '.club' support #3406

enduser420 commented Dec 14, 2022

enduser420 commented Dec 14, 2022

mikf commented Dec 15, 2022

enduser420 commented Dec 15, 2022

chenmoe commented Dec 15, 2022

enduser420 commented Dec 15, 2022

mikf commented Dec 15, 2022 •

edited

Loading

enduser420 commented Dec 15, 2022

[2chen] add '.club' support #3406

[2chen] add '.club' support #3406

Conversation

enduser420 commented Dec 14, 2022

enduser420 commented Dec 14, 2022

mikf commented Dec 15, 2022

enduser420 commented Dec 15, 2022

chenmoe commented Dec 15, 2022

enduser420 commented Dec 15, 2022

mikf commented Dec 15, 2022 • edited Loading

enduser420 commented Dec 15, 2022

mikf commented Dec 15, 2022 •

edited

Loading