[wikimedia] Add Wikipedia/Wikimedia extractor #2340

Ailothaen · 2022-02-27T19:35:41Z

This is related to the issue #1443

I wrote an extractor for Wikipedia and Wikimedia.

This is a first version that supports:

extracting images from an article, examples:

https://en.wikipedia.org/wiki/Athena
https://zh.wikipedia.org/wiki/太阳
https://simple.wikipedia.org/wiki/Hydrogen

extracting images from a category, examples:

https://commons.wikimedia.org/wiki/Category:Network_maps_of_the_Paris_Metro
https://commons.wikimedia.org/wiki/Category:Tyto_alba_in_flight_(captive)

Metadata are extracted from the items and passed accordingly to the downloader.

Since I do not have a 100% view of the codebase and process, do not hesitate to advise if there is something in this code that could be done better. I am thinking particularly to things related to:

rate limit (I manually put a time.sleep(1) between requests to not "stress" Wikimedia API too much)
making filenames/directories safe for creation, depending on the OS...
tests (hesitant to go more precise as Wikipedia/Wikimedia resources may change unpredictably over time)
line length: if you want to strongly enforce that, some advice would be welcome regarding the long URLs

PS: After this PR is merged, I intend to write a short guide in the wiki that would explain the first steps to create an extractor, since this may definitely help people.

God-damnit-all · 2022-02-27T20:48:03Z

Could you add the ability to extract all the wiki code from all the pages on the wiki as well, using the metadata event:post post-processor? This would be great for small wikis that I'm interested in backing up.

This is the easiest to do using the REST API: https://page.url/w/rest.php/v1/page/pagenamegoeshere

Ailothaen · 2022-02-28T11:25:06Z

@ImportTaste: I am not really familiar with postprocessors, so I am not sure to understand what you suggest here. The wikicode of the pages can surely be retrieved, but I wonder how this can be integrated in the process. It would be great if you could give me more explanations on the intended behavior!

Also, just in case: in case you (or anyone else reading this) are interested in backing up Wikipedia itself, whole dumps are provided here: https://wiki.kiwix.org/wiki/Content/fr
I consider it as a better way rather than backing up "manually" Wikipedia, since it costs them bandwidth and computing resources.

PS: I am probably going to push one or several more commits to this PR, because I noticed that archive_fmt can be replaced by something better (I am thinking to the SHA-1 of the file, since the API provides it)

Ailothaen · 2022-04-25T21:17:04Z

I just pushed a commit that improved the archive identifiers (as stated in my PS in my message above - a distinction was also added between categories).

Ailothaen · 2022-05-25T21:28:12Z

@mikf Sorry for the ping, but do you have any feedback on this pull request? I already made the modifications I was mentioning in my second message.

- rewrite using BaseExtractor - support most Wiki* domains - update docs/supportedsites - add tests

mikf mentioned this pull request Sep 14, 2022

[Request] Improve wikimedia/wikipedia support #2906

Closed

Ailothaen and others added 3 commits January 16, 2024 02:32

[wikimedia] Add Wikipedia/Wikimedia extractor

e33056a

[wikimedia] Improved archive identifiers

221f543

[wikimedia] update

c3c1635

- rewrite using BaseExtractor - support most Wiki* domains - update docs/supportedsites - add tests

mikf merged commit 34a7afd into mikf:master Jan 16, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wikimedia] Add Wikipedia/Wikimedia extractor #2340

[wikimedia] Add Wikipedia/Wikimedia extractor #2340

Ailothaen commented Feb 27, 2022

God-damnit-all commented Feb 27, 2022 •

edited

Loading

Ailothaen commented Feb 28, 2022 •

edited

Loading

Ailothaen commented Apr 25, 2022

Ailothaen commented May 25, 2022

[wikimedia] Add Wikipedia/Wikimedia extractor #2340

[wikimedia] Add Wikipedia/Wikimedia extractor #2340

Conversation

Ailothaen commented Feb 27, 2022

God-damnit-all commented Feb 27, 2022 • edited Loading

Ailothaen commented Feb 28, 2022 • edited Loading

Ailothaen commented Apr 25, 2022

Ailothaen commented May 25, 2022

God-damnit-all commented Feb 27, 2022 •

edited

Loading

Ailothaen commented Feb 28, 2022 •

edited

Loading