Hash images and save resources #566

Maingron · 2023-10-15T11:58:24Z

What about hashing images before / after compressing and comparing them to a local table of already compressed images, maybe even just from the same run? This should be a cheap way to occasionally save some resources. Also, this would mean that if we compressed the same image with a heavier algorithm beforehand, we can in the future get the same compression when running a light algorithm (Just by copying and verifying the older compressed image).
Said table could be stored in the OS's temp folder. If we have a match, we should also verify it's actually the same image and not just a hash collision.
In my opinion, for this, oxipng should also add a flag to disable the cache.

I'm partially referring to #549

While here already: Thank you for this great tool!

Maingron · 2023-10-15T12:49:14Z

Maybe I should mention why I chose some of my words / clarify a bit:

Obviously, the hash table would be best in the temp folder. Once the user wants to clean up, it's gone too. Maybe we could also add a flag that allows a custom path for the hash table.
Copy the older image from its path: We can't save a copy of all compressed images in the temp folder, because this would take way too much space. Instead we have to check if the older one still exists in its location and compare everything. If there's some mismatch, delete the entry from the hash table and just go ahead as we would usually.
Copy the older image from its location: I guess it would be wise to only extract the actual image itself and "compile" our own image from that. This way we can ignore all metadata and whatsoever. This probably should also be the part we actually hash. I think, this would achieve a greater yield and we can just handle our metadata from there, which shouldn't be to heavy. (Talking about unnecessary or unrelated metadata and file attributes. Maybe rotation could be a thing too.)

Regarding flags, as of now, I already thought of these:

Disable Hash Table stuff
Custom Path for Hash Table This could also be relevant for privacy reasons, for example if you have customers and absolutely want to make sure nothing will mix
Force same algorithm with this flag, we will only use the existing result if its the same algorithm we're using. Maybe the user doesn't want a free and better result for some reason 🤷

Maybe there could also be a "TTL" defined by either the Oxipng Version or Age. Maybe the user doesn't want to keep hashes from more than 2 years ago or just wants to refresh the results after a given time since algorithms may have improved.

Everything will depend on the implementation, though. I'm just sharing my thoughts. That said - I have little knowledge about how PNGs work behind the scenes, so I don't know if the technical aspects actually work. I'm just pretty certain it would work if we hash the entire files and everything else is assuming PNGs work the way I think they do.

andrews05 · 2023-10-16T05:58:52Z

Hi @Maingron

Just so I understand, the main distinction between what you're proposing and the discussions in #549 is that this will effectively handle duplicates of the same image, is that right? It seems extremely similar, so I would prefer if you could add your ideas to the existing issue and I can close this as a duplicate.

Maingron · 2023-10-16T09:36:31Z

@andrews05 I don't think its too similar.
As I understand, #549 just keeps track of which images (or actually paths) it already processed and doesn't touch them again while we're checking if we've already worked on such image in the past and if so, we re-use what we can and work our way from there. We're still touching each and every image.

Best example I can think of at the moment: If we have a folder with 10000 images and the process gets interrupted at image 9500, we would start again at image 1. #549 would start at image 5901.

So basically we index everything for future reference while #549 just doesn't want to touch anything twice.

Another difference is that we will notice if anything about an imaged changed while #549 doesn't since it already worked on it so it doesn't bother. #549 also would only care about the very path and not about the actual data

andrews05 · 2023-10-17T05:32:45Z

I'm trying to understand this at a conceptual level rather than an implementation level. Can you describe at a high level what you're actually wanting to achieve? What is your use case?

Maingron · 2023-10-23T09:56:57Z

@andrews05 Well, let's assume I'm some non-technical user and make up some story:

It would be cool if Oxipng worked much faster. Oftentimes I have the same image within multiple projects I manage and it really can hinder my workflow when I have to wait for the same image to process within multiple directories. It should be possible for Oxipng to remember that it already processed this image, just in another directory and automatically copy it.

Also, once a week, I get some data to manage, almost like a backup, which I always run through Oxipng before doing anything else with the data. Usually the data doesn't change too much, so it would go so much faster if Oxipng remembered the work from last week. I can't manually copy the processed images from last week over however, to avoid copying data that has actually changed. Also there's just way too many sub-directories.

andrews05 · 2023-11-08T21:31:36Z

Right, thanks for the user story!
To try to summarise this and #549: Both follow a similar theme of avoiding redundant optimisation of images that have already been processed once before. The difference is that this one is about repeat copies of the same image, while #549 is about repeat processing of the same file. Do you think that's accurate?

Despite serving slightly different use cases, I still think they're conceptually related and it would be great if we could collect all these ideas together in the same topic.

andrews05 · 2023-12-20T05:00:15Z

Closing in favour of #549

andrews05 closed this as completed Dec 20, 2023

andrews05 mentioned this issue Jun 22, 2024

Faster method to bail on already optimized pngs? #628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash images and save resources #566

Hash images and save resources #566

Maingron commented Oct 15, 2023

Maingron commented Oct 15, 2023

andrews05 commented Oct 16, 2023

Maingron commented Oct 16, 2023 •

edited

Loading

andrews05 commented Oct 17, 2023

Maingron commented Oct 23, 2023

andrews05 commented Nov 8, 2023 •

edited

Loading

andrews05 commented Dec 20, 2023

Hash images and save resources #566

Hash images and save resources #566

Comments

Maingron commented Oct 15, 2023

Maingron commented Oct 15, 2023

andrews05 commented Oct 16, 2023

Maingron commented Oct 16, 2023 • edited Loading

andrews05 commented Oct 17, 2023

Maingron commented Oct 23, 2023

andrews05 commented Nov 8, 2023 • edited Loading

andrews05 commented Dec 20, 2023

Maingron commented Oct 16, 2023 •

edited

Loading

andrews05 commented Nov 8, 2023 •

edited

Loading