Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wikimedia] add 'wiki' extractor #6050

Merged
merged 5 commits into from
Aug 25, 2024

Conversation

ClosedPort22
Copy link
Contributor

This PR adds the ability to download all media files hosted on a MediaWiki instance.

"generator": "images",
"titles" : path,
}
self.per_page = self.config("limit", 10)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I set the default to a larger value, or perhaps even eliminate the option altogether? In theory, setting this to 500 could cause some issues if the wiki is hosted on low end hardware or has poor connectivity.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set it to 50 by default and limit it to 200 so others don't go overboard with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests on Fandom and a few smaller wikis and they all tolerated 500 pretty well, so I guess I was just overthinking it.

archive_fmt = "{sha1}"
request_interval = (1.0, 2.0)

def __init__(self, match):
BaseExtractor.__init__(self, match)
path = match.group(match.lastindex)

if self.category == "wikimedia":
self.category = self.root.split(".")[-2]
elif self.category in ("fandom", "wikigg"):
self.category = "{}-{}".format(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to before BaseExtractor.__init__ is called?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.category only has a value after BaseExtractor.__init__ was called, so this not really an option I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It's a bit of a shame that the Fandom wikis can't be controlled individually, but I guess you could just make separate config files to achieve the same result.

@mikf mikf merged commit 4b286e8 into mikf:master Aug 25, 2024
9 checks passed
@ClosedPort22 ClosedPort22 deleted the feature/mediawiki-allpages branch August 27, 2024 14:39
@ClosedPort22 ClosedPort22 restored the feature/mediawiki-allpages branch August 27, 2024 14:39
@ClosedPort22 ClosedPort22 deleted the feature/mediawiki-allpages branch August 27, 2024 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants