Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wikimedia] Add Wikipedia/Wikimedia extractor #2340

Merged
merged 3 commits into from
Jan 16, 2024
Merged

[wikimedia] Add Wikipedia/Wikimedia extractor #2340

merged 3 commits into from
Jan 16, 2024

Conversation

Ailothaen
Copy link
Contributor

This is related to the issue #1443

I wrote an extractor for Wikipedia and Wikimedia.

This is a first version that supports:

  • extracting images from an article, examples:
https://en.wikipedia.org/wiki/Athena
https://zh.wikipedia.org/wiki/太阳
https://simple.wikipedia.org/wiki/Hydrogen
  • extracting images from a category, examples:
https://commons.wikimedia.org/wiki/Category:Network_maps_of_the_Paris_Metro
https://commons.wikimedia.org/wiki/Category:Tyto_alba_in_flight_(captive)

Metadata are extracted from the items and passed accordingly to the downloader.

Since I do not have a 100% view of the codebase and process, do not hesitate to advise if there is something in this code that could be done better. I am thinking particularly to things related to:

  • rate limit (I manually put a time.sleep(1) between requests to not "stress" Wikimedia API too much)
  • making filenames/directories safe for creation, depending on the OS...
  • tests (hesitant to go more precise as Wikipedia/Wikimedia resources may change unpredictably over time)
  • line length: if you want to strongly enforce that, some advice would be welcome regarding the long URLs

PS: After this PR is merged, I intend to write a short guide in the wiki that would explain the first steps to create an extractor, since this may definitely help people.

@God-damnit-all
Copy link
Contributor

God-damnit-all commented Feb 27, 2022

Could you add the ability to extract all the wiki code from all the pages on the wiki as well, using the metadata event:post post-processor? This would be great for small wikis that I'm interested in backing up.

This is the easiest to do using the REST API: https://page.url/w/rest.php/v1/page/pagenamegoeshere

@Ailothaen
Copy link
Contributor Author

Ailothaen commented Feb 28, 2022

@ImportTaste: I am not really familiar with postprocessors, so I am not sure to understand what you suggest here. The wikicode of the pages can surely be retrieved, but I wonder how this can be integrated in the process. It would be great if you could give me more explanations on the intended behavior!

Also, just in case: in case you (or anyone else reading this) are interested in backing up Wikipedia itself, whole dumps are provided here: https://wiki.kiwix.org/wiki/Content/fr
I consider it as a better way rather than backing up "manually" Wikipedia, since it costs them bandwidth and computing resources.

PS: I am probably going to push one or several more commits to this PR, because I noticed that archive_fmt can be replaced by something better (I am thinking to the SHA-1 of the file, since the API provides it)

@Ailothaen
Copy link
Contributor Author

I just pushed a commit that improved the archive identifiers (as stated in my PS in my message above - a distinction was also added between categories).

@Ailothaen
Copy link
Contributor Author

@mikf Sorry for the ping, but do you have any feedback on this pull request? I already made the modifications I was mentioning in my second message.

Ailothaen and others added 3 commits January 16, 2024 02:32
- rewrite using BaseExtractor
- support most Wiki* domains
- update docs/supportedsites
- add tests
@mikf mikf merged commit 34a7afd into mikf:master Jan 16, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants