QoL changes #19

kolt54321 · 2020-07-14T03:19:14Z

So these changes should set zoom levels to conform to JPEG limits, allow bringing in a list of links from a batch CSV with a URL list, create (and maintain) a cache/queue and flag downloaded files so no re-downloading occures, and adds metadata tags from the website art description to the file. It also uses that same set of tags to rework the output name for sorting downloads better.

Let's discuss. Thanks!

External file that has arguments for importing a list of URL's as well as downloading a list in the queue/cache. After downloading an image it sets the download flag to true - so if the script breaks in middle it will just pick up from where it left off. Some measures here are redundent and can be consolidated further. A measure is in place to keep the zoom level within JPEG limits. Small sleep included so we (hopefully) don't hit 429 errors.

Revising the filename for sorting on artist, and chronological order. While author name is usually in the URL along with the art name, it's impossible to separate the two by code. Hence reliance on metadata tags to grab author (and date of painting). Would need a backup plan when it can't find author in the data - though it shouls still be named fine. Cases where author name is not in the URL, however, may cause incorrectly truncating the image name. The embed metadata can be viewed by dropping the final image into exiftool(-k) and so forth.

kolt54321 · 2020-07-14T03:20:17Z

Forgot to mention - autodownload.py can probably be merged into the main tile_fetch.py (which would also solve some duplication).

lovasoa

Thanks for the PR ! I highlighted a few possible enhancements

lovasoa · 2020-07-14T10:58:57Z

autodownload.py

+import asyncio
+import tile_fetch
+import sys
+import pandas as pd


pandas is a huge library. It would be great if we could avoid having to add it to our dependencies

My friend sketched up the original code (albeit very different from now) using pandas, presumably for file handling. I'll see if I can find an alternative.

lovasoa · 2020-07-14T11:00:20Z

requirements.txt

+Pillow~=7.1.2
+aiohttp~=3.6.2
+pyexiv2~=2.2.0
+cssselect
+unidecode
+pandas
+html


It would be good if we could avoid multipying the dependencies

Sorry - mind to clarify? I'm not familiar with dependencies so not sure what I did wrong here.

Can we remove unidecode and pandas and not make cssselect and html mandatory ? We can try to import them and disable the corresponding features when they are not available.

Any idea of how I would make cssselect conditional? I thought it was needed for scraping the metadata. As far as I know, every Arts&Culture page has metadata that can be scraped (if not for embedding, at least for final name formatting)

lovasoa · 2020-07-14T11:01:19Z

tile_fetch.py

@@ -51,7 +61,8 @@ def __init__(self, url):
        self.token = token or b''
        url_path = urllib.parse.unquote_plus(urllib.parse.urlparse(url).path)
        self.image_slug, image_id = url_path.split('/')[-2:]
-        self.image_name = '%s - %s' % (string.capwords(self.image_slug.replace("-"," ")), image_id)
+        self.image_name = unidecode.unidecode(string.capwords(self.image_slug.replace("-"," ")))


Why do we need to remove non-ascii characters here ?

I believe I ran into some cases where the filename contained non-ascii characters, and thus download wouldn't finalize on Windows. By replacing non-ascii characters it becomes system compliant.

What characters exactly ? All major operating systems accept unicode characters in filenames. And on the other hand, there are many ASCII characters that are rejected on windows. Can we just strip invalid characters from the name ?

It was a Spanish accent name, one of these "e"s. I don't have the URL on hand. I would avoid stripping those letters from the name - it's part of the actual name. Replacing it with a like-letter such as "e" seems to be a better solution.

Accented letters can be parts of filenames in all operating systems, with all but the most ancient filesystems

That is weird, because I think this link was running into issues without this piece of code to handle the characters, under the new naming scheme.

lovasoa · 2020-07-14T11:06:38Z

tile_fetch.py

+    ## Ensuring image resolution fits in JPEG - two pass 
+    if info.tile_info[z].size[0] > 65535 or info.tile_info[z].size[1] > 65535:
+        print(
+            'Zoom level {r} too high for JPEG output, using next zoom level {next_z} instead'.format(
+                r=z,
+                next_z=z-1)
+        )
+        z = z-1
+
+    if info.tile_info[z].size[0] > 65535 or info.tile_info[z].size[1] > 65535:
+        print(
+            'Zoom level {r} *still* too high for JPEG output, using next zoom level {next_z} instead'.format(
+                r=z,
+                next_z=z-1)
+        )
+        z = z-1


If the user has specifically requested a zoom level, we should respect it (and save as PNG if the image is too large)

But when we are choosing the zoom level automatically, we can run this instead of always choosing z = len(info.tile_info) - 1.

And this should be done in a loop, not with 2 successive checks.

We can save it as a PNG if the zoom is too large, but would then have to skip all code appending metadata to the file - PNG doesn't support it.

Correct - I'll make it into a loop.

Yes, the code that appends metadata must be in a conditional.

Both of these should be done in latest commit... but please let me know if I missed something.

lovasoa · 2020-07-14T11:17:58Z

tile_fetch.py

+    # Taking out the author's name from the image name - authors name is appended later
+    modified_image_name = info.image_name[0:len(info.image_name)-len(author)-1]
+
+    final_image_filename = (author + ' - ' + date + ' - ' + modified_image_name + ' - ' +info.image_id + '.jpg')


Save as png if the image is too large

Should be done in latest commit.

lovasoa · 2020-07-14T11:18:43Z

tile_fetch.py

+    xmp_file_obj = TaggedImage(final_image_filename)
+
+    # writes key:value one at a time, which is heavier on writes,
+    # but far more robust.


In what situation can tag writes fail ? Can't it be known ahead of time ?

My friend added this piece of catch-code - but I believe some foreign characters (Japanese, perhaps? Something unsupported on the host system) can fail. We came across one such case, but I forget what the details were behind it failing. Most times there are less than 10 tags in the page description, so I didn't feel it's a big deal.

Can you try to find the cases that make this fail ?

Edit: ok, I just saw your comment below, thanks

As pere here the limit seems to be 2gb. Maybe the image before being saved is larger than that in memory?

Can you try to find the cases that make this fail ?

Edit: ok, I just saw your comment below, thanks

In addition, I think this link prompted my friend to write the key:values line by line. Below is the error message I originally sent him. I think writing each value separately is better so that the whole thing doesn't fail if an XMP can't be embed, for whatever reason.

kolt54321 · 2020-07-14T12:21:22Z

One example of tags failing:

C:\Users\h\Downloads\gapdecoder>python tile_fetch.py https://artsandculture.google.com/asset/the-kiss-gustav-klimt/HQGxUutM_F6ZGg
Downloading image meta-information...
the-kiss-gustav-klimt - zoom levels:
level  0:    480 x    481 (     1 tiles)
level  1:    959 x    962 (     4 tiles)
level  2:   1917 x   1923 (    16 tiles)
level  3:   3833 x   3846 (    64 tiles)
level  4:   7666 x   7692 (   240 tiles)
level  5:  15332 x  15384 (   930 tiles)
level  6:  30664 x  30767 (  3660 tiles)
level  7:  61328 x  61535 ( 14520 tiles)
Which level do you want to download? 
7
Downloading tiles...
100.0%
Downloaded all tiles. Saving...
Failed to add add XMP tag with key "Xmp.xmp.URL" with value "https://artsandculture.google.com/asset/the-kiss-gustav-klimt/HQGxUutM_F6ZGg"
Failed to add add XMP tag with key "Xmp.xmp.Title" with value "The Kiss"
Failed to add add XMP tag with key "Xmp.xmp.Creator" with value "Gustav Klimt"
Failed to add add XMP tag with key "Xmp.xmp.DateCreated" with value "1908-1909"
Failed to add add XMP tag with key "Xmp.xmp.PhysicalDimensions" with value "w180 x h180 cm"
Failed to add add XMP tag with key "Xmp.xmp.Type" with value "Oil on canvas"
Failed to add add XMP tag with key "Xmp.xmp.ExternalLink" with value "Explore the Belvedere online collection"
Saved the result as The Kiss Gustav Klimt - HQGxUutM_F6ZGg.jpg

This scenario fails to write any XMP tags, rather than a single one. I am yet not sure why. Running the same on zoom 3 allows for tags to be embed.

lovasoa · 2020-07-14T15:15:30Z

What is the exact RuntimeError being thrown?

kolt54321 · 2020-07-14T15:17:45Z

What is the exact RuntimeError being thrown?

RuntimeError('Memory allocation failed') when adding some code that describes the error. I'm trying to (hopefully) solve it here.

kolt54321 · 2020-07-15T15:09:42Z

Small update: I've made a couple of changes (not all the suggested ones yet, but slowly) and am testing the newer version now before pushing to the branch.

A few changes:

Integrate autodownload.py into tile_fetch.py. It is still slightly redundant but much less than before.
Saving as PNG options when image is too large, and reverting the final name to image_name if we can't grab the author from the tags. Metadata embedding is now conditional on JPEG output, and though newer versions of PIL allow for embedding exif info, I'm not sure how that works.
Metadata embedding tries to do all tags at once, and if that fails, one at a time with failing tag explaining why.
Zoom-fitting to JPEG is now optional (though mandatory for now when downloading from the queue with flag -d), at least crudely ("choose zoom 11 to default to largest JPEG-compliant level"). We can tweak this further as we go.

So far it works well, but there's an error later on when saving the file (still trying to figure out why). Pandas is still a requirement for now though I know we should move off of it if possible.

kolt54321 · 2020-07-15T17:15:53Z

Looks like it's limited to 1gb for writing XMP tags. I tried looking into Python XMP Toolkit as an alternative but it seems that's not well documented or tested on Windows.

LeoHsiao1/pyexiv2#26

Exiv2/exiv2#1248

kolt54321 · 2020-07-15T17:30:39Z

These new changes should solve some of the issues, I'll see if I can test tonight. Not sure why the build failed.

lovasoa · 2020-07-15T21:10:42Z

Not sure why the build failed.

Because unidecode was removed from requirements.txt, but is still being unconditionally imported in tile_fetch.py

kolt54321 · 2020-07-15T21:20:48Z

Not sure why the build failed.

Because unidecode was removed from requirements.txt, but is still being unconditionally imported in tile_fetch.py

I see. I don't know how to test if there are non-ascii characters in the name (and therefore whether we need unidecode to translate them into an ascii-equivalent). Are you okay with me putting it back into the requirements?

This reverts commit 13b86bf.

lovasoa · 2020-07-16T07:44:15Z

As I said, I don't think there is any major platform that disallows non-ascii filenames in its default filesystem. Unicode characters in filenames are okay.

kolt54321 · 2020-07-16T12:00:56Z

As I said, I don't think there is any major platform that disallows non-ascii filenames in its default filesystem. Unicode characters in filenames are okay.

Could you try downloading with the following link? Any zoom should be fine. I took out the unidecode pieces from the code attached, and get this error.

C:\Users\i\Downloads\gapdecoder>python tile_fetch.py --zoom 3 https://artsandculture.google.com/asset/cabeza-de-hierro-jaume-morera-gal%C3%ADcia/ZQGdSYXWL7WZUw
Downloading image meta-information...
Downloading tiles...
100.0%
Downloaded all tiles. Sing...
Traceback (most recent call last):
  File "tile_fetch.py", line 237, in <module>
    main()
  File "tile_fetch.py", line 233, in main
    loop.run_until_complete(coro)
  File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 608, in run_until_complete
    return future.result()
  File "tile_fetch.py", line 187, in load_tiles
    xmp_file_obj = TaggedImage(final_image_filename)
  File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\pyexiv2\core.py", line 16, in __init__
    self.img = api.open_image(filename.encode(encoding))
RuntimeError: Jaume Morera Galícia - 1891-1897 - Cabeza De Hierro Jaume Morera Gal - ZQGdSYXWL7WZUw.jpg: Failed to open the data source: No such file or directory (errno = 2)

lovasoa · 2020-07-16T12:41:53Z

This looks like an instance of the following bug in pyexiv2: LeoHsiao1/pyexiv2#21

kolt54321 · 2020-07-16T13:58:17Z

I see. Is there a solution that doesn't involve unidecode? Since (almost) all GA&C links have metadata, and unidecode doesn't seem like a costly requirement (compared to, say, Pandas as you mentioned), I'm not averse to keeping it in the requirements.

lovasoa · 2020-07-16T16:18:44Z

Don't you think more people want the filenames to be correct than having embedded meta-information ?
And if you still want to add the metadata, you can use the workaround mentioned in LeoHsiao1/pyexiv2#21

kolt54321 · 2020-07-16T16:58:05Z

I guess that's where we have different opinions (which is okay). I feel it is very important to know where the file came from (source URL, which is included in the metadata) as well as the author name separate from the image name. The source URL can technically be derived from the ID, which works right now appended to any other URL, but I wouldn't rely on that working moving forward. The artist name is included in the final image name, but not separate from it. The date of the painting is important to me as well, so it's no surprise I prefer to have that vs í against i. It's all my personal preference though.

I think I'm not understanding the workaround well, since using ImageData doesn't seem to work either...

Instead of taking directly from the metadata, this pulls the author/date created from the page source. Cleaner and more reliable.

Note - still fails where "!" is contained in the name. Otherwise creates and updates folders with each artist's name.

Loops through artist pages and grabs links. Has Chrome plugin dependencies - not plug/play.

kolt54321 · 2021-07-25T19:42:14Z

Unfortunately, dependencies are still sloppy... my non-comp sci background showing. I haven't had a chance to replace Pandas - I know you preferred an alternative. For now this works (though not ideal) for batch downloading, and includes code to "skip" (aka ignore) any errors as videos and some other GAC pages will throw an error and stop the code. Included are some scripts to both gather links to begin with, as well as organize the outputs into folders for each artist (outputs organized by %artist% - %date% - %name% - %id%).

kolt54321 added 5 commits July 13, 2020 22:50

Defaulting to zoom levels that fit in JPEG

5cff461

Small forgotten changes

d08f19a

Requirement updates

521c015

lovasoa requested changes Jul 14, 2020

View reviewed changes

Integrating autodownload into main file, PNG output, other changes

0c7144a

Update conditional

f570a2a

kolt54321 added 3 commits July 15, 2020 17:27

Update requirements.txt

13b86bf

Revert "Update requirements.txt"

cffddab

This reverts commit 13b86bf.

Update tile_fetch.py

4bbf4bf

kolt54321 added 4 commits July 25, 2021 15:15

Updated author/date created source

c458710

Instead of taking directly from the metadata, this pulls the author/date created from the page source. Cleaner and more reliable.

Organizes outputs into artist folders

9601eb9

Note - still fails where "!" is contained in the name. Otherwise creates and updates folders with each artist's name.

AHK to acquire links for batch

5365fca

Loops through artist pages and grabs links. Has Chrome plugin dependencies - not plug/play.

Updated requirements

349bb6c

Push Regex Change

ebf1767

QoL changes #19

Are you sure you want to change the base?

QoL changes #19

Conversation

kolt54321 commented Jul 14, 2020

kolt54321 commented Jul 14, 2020

lovasoa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolt54321 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lovasoa Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

kolt54321 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolt54321 Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolt54321 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

lovasoa Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolt54321 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

kolt54321 commented Jul 14, 2020 • edited Loading

lovasoa commented Jul 14, 2020

kolt54321 commented Jul 14, 2020 • edited Loading

kolt54321 commented Jul 15, 2020 • edited Loading

kolt54321 commented Jul 15, 2020 • edited Loading

kolt54321 commented Jul 15, 2020 • edited Loading

lovasoa commented Jul 15, 2020

kolt54321 commented Jul 15, 2020 • edited Loading

lovasoa commented Jul 16, 2020

kolt54321 commented Jul 16, 2020 • edited Loading

lovasoa commented Jul 16, 2020

kolt54321 commented Jul 16, 2020

lovasoa commented Jul 16, 2020 • edited Loading

kolt54321 commented Jul 16, 2020 • edited Loading

kolt54321 commented Jul 25, 2021

kolt54321 Jul 14, 2020 •

edited

Loading

lovasoa Jul 14, 2020 •

edited

Loading

kolt54321 Jul 14, 2020 •

edited

Loading

kolt54321 Jul 15, 2020 •

edited

Loading

kolt54321 Jul 14, 2020 •

edited

Loading

lovasoa Jul 14, 2020 •

edited

Loading

kolt54321 Jul 14, 2020 •

edited

Loading

kolt54321 commented Jul 14, 2020 •

edited

Loading

kolt54321 commented Jul 14, 2020 •

edited

Loading

kolt54321 commented Jul 15, 2020 •

edited

Loading

kolt54321 commented Jul 15, 2020 •

edited

Loading

kolt54321 commented Jul 15, 2020 •

edited

Loading

kolt54321 commented Jul 15, 2020 •

edited

Loading

kolt54321 commented Jul 16, 2020 •

edited

Loading

lovasoa commented Jul 16, 2020 •

edited

Loading

kolt54321 commented Jul 16, 2020 •

edited

Loading