-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QoL changes #19
base: master
Are you sure you want to change the base?
QoL changes #19
Conversation
External file that has arguments for importing a list of URL's as well as downloading a list in the queue/cache. After downloading an image it sets the download flag to true - so if the script breaks in middle it will just pick up from where it left off. Some measures here are redundent and can be consolidated further. A measure is in place to keep the zoom level within JPEG limits. Small sleep included so we (hopefully) don't hit 429 errors.
Revising the filename for sorting on artist, and chronological order. While author name is usually in the URL along with the art name, it's impossible to separate the two by code. Hence reliance on metadata tags to grab author (and date of painting). Would need a backup plan when it can't find author in the data - though it shouls still be named fine. Cases where author name is not in the URL, however, may cause incorrectly truncating the image name. The embed metadata can be viewed by dropping the final image into exiftool(-k) and so forth.
Forgot to mention - autodownload.py can probably be merged into the main tile_fetch.py (which would also solve some duplication). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR ! I highlighted a few possible enhancements
autodownload.py
Outdated
import asyncio | ||
import tile_fetch | ||
import sys | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas is a huge library. It would be great if we could avoid having to add it to our dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My friend sketched up the original code (albeit very different from now) using pandas, presumably for file handling. I'll see if I can find an alternative.
requirements.txt
Outdated
Pillow~=7.1.2 | ||
aiohttp~=3.6.2 | ||
pyexiv2~=2.2.0 | ||
cssselect | ||
unidecode | ||
pandas | ||
html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if we could avoid multipying the dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - mind to clarify? I'm not familiar with dependencies so not sure what I did wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove unidecode and pandas and not make cssselect and html mandatory ? We can try to import them and disable the corresponding features when they are not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea of how I would make cssselect conditional? I thought it was needed for scraping the metadata. As far as I know, every Arts&Culture page has metadata that can be scraped (if not for embedding, at least for final name formatting)
tile_fetch.py
Outdated
@@ -51,7 +61,8 @@ def __init__(self, url): | |||
self.token = token or b'' | |||
url_path = urllib.parse.unquote_plus(urllib.parse.urlparse(url).path) | |||
self.image_slug, image_id = url_path.split('/')[-2:] | |||
self.image_name = '%s - %s' % (string.capwords(self.image_slug.replace("-"," ")), image_id) | |||
self.image_name = unidecode.unidecode(string.capwords(self.image_slug.replace("-"," "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to remove non-ascii characters here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I ran into some cases where the filename contained non-ascii characters, and thus download wouldn't finalize on Windows. By replacing non-ascii characters it becomes system compliant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What characters exactly ? All major operating systems accept unicode characters in filenames. And on the other hand, there are many ASCII characters that are rejected on windows. Can we just strip invalid characters from the name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a Spanish accent name, one of these "e"s. I don't have the URL on hand. I would avoid stripping those letters from the name - it's part of the actual name. Replacing it with a like-letter such as "e" seems to be a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accented letters can be parts of filenames in all operating systems, with all but the most ancient filesystems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is weird, because I think this link was running into issues without this piece of code to handle the characters, under the new naming scheme.
tile_fetch.py
Outdated
## Ensuring image resolution fits in JPEG - two pass | ||
if info.tile_info[z].size[0] > 65535 or info.tile_info[z].size[1] > 65535: | ||
print( | ||
'Zoom level {r} too high for JPEG output, using next zoom level {next_z} instead'.format( | ||
r=z, | ||
next_z=z-1) | ||
) | ||
z = z-1 | ||
|
||
if info.tile_info[z].size[0] > 65535 or info.tile_info[z].size[1] > 65535: | ||
print( | ||
'Zoom level {r} *still* too high for JPEG output, using next zoom level {next_z} instead'.format( | ||
r=z, | ||
next_z=z-1) | ||
) | ||
z = z-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user has specifically requested a zoom level, we should respect it (and save as PNG if the image is too large)
But when we are choosing the zoom level automatically, we can run this instead of always choosing z = len(info.tile_info) - 1
.
And this should be done in a loop, not with 2 successive checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save it as a PNG if the zoom is too large, but would then have to skip all code appending metadata to the file - PNG doesn't support it.
Correct - I'll make it into a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the code that appends metadata must be in a conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these should be done in latest commit... but please let me know if I missed something.
tile_fetch.py
Outdated
# Taking out the author's name from the image name - authors name is appended later | ||
modified_image_name = info.image_name[0:len(info.image_name)-len(author)-1] | ||
|
||
final_image_filename = (author + ' - ' + date + ' - ' + modified_image_name + ' - ' +info.image_id + '.jpg') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save as png if the image is too large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be done in latest commit.
tile_fetch.py
Outdated
xmp_file_obj = TaggedImage(final_image_filename) | ||
|
||
# writes key:value one at a time, which is heavier on writes, | ||
# but far more robust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation can tag writes fail ? Can't it be known ahead of time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My friend added this piece of catch-code - but I believe some foreign characters (Japanese, perhaps? Something unsupported on the host system) can fail. We came across one such case, but I forget what the details were behind it failing. Most times there are less than 10 tags in the page description, so I didn't feel it's a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to find the cases that make this fail ?
Edit: ok, I just saw your comment below, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As pere here the limit seems to be 2gb. Maybe the image before being saved is larger than that in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to find the cases that make this fail ?
Edit: ok, I just saw your comment below, thanks
In addition, I think this link prompted my friend to write the key:values line by line. Below is the error message I originally sent him. I think writing each value separately is better so that the whole thing doesn't fail if an XMP can't be embed, for whatever reason.
One example of tags failing:
This scenario fails to write any XMP tags, rather than a single one. I am yet not sure why. Running the same on zoom 3 allows for tags to be embed. |
What is the exact RuntimeError being thrown? |
RuntimeError('Memory allocation failed') when adding some code that describes the error. I'm trying to (hopefully) solve it here. |
Small update: I've made a couple of changes (not all the suggested ones yet, but slowly) and am testing the newer version now before pushing to the branch. A few changes:
So far it works well, but there's an error later on when saving the file (still trying to figure out why). Pandas is still a requirement for now though I know we should move off of it if possible. |
Looks like it's limited to 1gb for writing XMP tags. I tried looking into Python XMP Toolkit as an alternative but it seems that's not well documented or tested on Windows. |
These new changes should solve some of the issues, I'll see if I can test tonight. Not sure why the build failed. |
Because unidecode was removed from requirements.txt, but is still being unconditionally imported in tile_fetch.py |
I see. I don't know how to test if there are non-ascii characters in the name (and therefore whether we need unidecode to translate them into an ascii-equivalent). Are you okay with me putting it back into the requirements? |
This reverts commit 13b86bf.
As I said, I don't think there is any major platform that disallows non-ascii filenames in its default filesystem. Unicode characters in filenames are okay. |
Could you try downloading with the following link? Any zoom should be fine. I took out the unidecode pieces from the code attached, and get this error.
|
This looks like an instance of the following bug in pyexiv2: LeoHsiao1/pyexiv2#21 |
I see. Is there a solution that doesn't involve unidecode? Since (almost) all GA&C links have metadata, and unidecode doesn't seem like a costly requirement (compared to, say, Pandas as you mentioned), I'm not averse to keeping it in the requirements. |
Don't you think more people want the filenames to be correct than having embedded meta-information ? |
I guess that's where we have different opinions (which is okay). I feel it is very important to know where the file came from (source URL, which is included in the metadata) as well as the author name separate from the image name. The source URL can technically be derived from the ID, which works right now appended to any other URL, but I wouldn't rely on that working moving forward. The artist name is included in the final image name, but not separate from it. The date of the painting is important to me as well, so it's no surprise I prefer to have that vs I think I'm not understanding the workaround well, since using ImageData doesn't seem to work either... |
Instead of taking directly from the metadata, this pulls the author/date created from the page source. Cleaner and more reliable.
Note - still fails where "!" is contained in the name. Otherwise creates and updates folders with each artist's name.
Loops through artist pages and grabs links. Has Chrome plugin dependencies - not plug/play.
Unfortunately, dependencies are still sloppy... my non-comp sci background showing. I haven't had a chance to replace Pandas - I know you preferred an alternative. For now this works (though not ideal) for batch downloading, and includes code to "skip" (aka ignore) any errors as videos and some other GAC pages will throw an error and stop the code. Included are some scripts to both gather links to begin with, as well as organize the outputs into folders for each artist (outputs organized by %artist% - %date% - %name% - %id%). |
So these changes should set zoom levels to conform to JPEG limits, allow bringing in a list of links from a batch CSV with a URL list, create (and maintain) a cache/queue and flag downloaded files so no re-downloading occures, and adds metadata tags from the website art description to the file. It also uses that same set of tags to rework the output name for sorting downloads better.
Let's discuss. Thanks!