genre_scraper.py only scraping 4000 images max #18

toemm · 2018-11-05T15:27:56Z

The scraper works fantastic but is unable to get more than 3000-4000 images from wikiart. I tried adjustung num_pages (up to 4000 pages) but it won't scrape more than 4k pictures.

Maybe it is because on the webpage it is also only showing max 3600 pictures? As can be seen here: https://www.wikiart.org/en/paintings-by-genre/portrait?select=featured#!#filterName:featured,viewType:masonry

Is there any fix to this because I'd like to train the network on more than 4k pictures.

robbiebarrat · 2018-11-05T22:43:41Z

try it now - i just updated the scraper

toemm · 2018-11-05T23:26:19Z

Thanks for the quick update but it still only attempts to load 3915 pictures, tried with different num_pages values but no avail.

toemm · 2018-11-10T23:21:31Z

I've tried everything and couldn't fix this. :(
You updated the code but only changed one import, I don't think it does anything. The sites only shows 3600 pictures per style/genre.

robbiebarrat · 2018-11-11T10:00:25Z

I'll look into this more over the weekend - really sorry it doesn't work, and thanks for bringing it to my attention - leaving this thread open until i fix it...

josh-marsh · 2018-11-26T09:45:54Z

I am having the same problem. If this is not resolvable, would it be possible for you to upload the complete set of images that I assume you still have stored somewhere to a google drive folder? It would be incredibly appreciated. Cheers

robbiebarrat · 2018-11-26T10:03:44Z

@JOHN-MARSH i'm still looking into it - i think it might be a question of too many threads working at once... i think it will be resolvable.

josh-marsh · 2018-11-27T11:33:45Z

Cheers mate. I would try to fix it myself, but web scraping is not something that I have experience with. Keep us updated!

enochkan · 2019-03-21T18:28:25Z

any updates?

jyu-theartofml · 2020-07-12T18:36:49Z

Not sure if this is a related issue, but I had problem scraping image names that are not utf-8 compatible because it had accent characters. I fixed the problem by adding urllib.parse.quote under def downloader as follows,

file=urllib.parse.quote(file, safe=':/')
filepath = file.split('/')

toemm mentioned this issue Nov 5, 2018

genre_scraper.py only scraping a maximum of ~4000 pictures ml4a/ml4a#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genre_scraper.py only scraping 4000 images max #18

genre_scraper.py only scraping 4000 images max #18

toemm commented Nov 5, 2018

robbiebarrat commented Nov 5, 2018

toemm commented Nov 5, 2018

toemm commented Nov 10, 2018

robbiebarrat commented Nov 11, 2018

josh-marsh commented Nov 26, 2018

robbiebarrat commented Nov 26, 2018

josh-marsh commented Nov 27, 2018

enochkan commented Mar 21, 2019

jyu-theartofml commented Jul 12, 2020

genre_scraper.py only scraping 4000 images max #18

genre_scraper.py only scraping 4000 images max #18

Comments

toemm commented Nov 5, 2018

robbiebarrat commented Nov 5, 2018

toemm commented Nov 5, 2018

toemm commented Nov 10, 2018

robbiebarrat commented Nov 11, 2018

josh-marsh commented Nov 26, 2018

robbiebarrat commented Nov 26, 2018

josh-marsh commented Nov 27, 2018

enochkan commented Mar 21, 2019

jyu-theartofml commented Jul 12, 2020