Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash images and save resources #566

Closed
Maingron opened this issue Oct 15, 2023 · 7 comments
Closed

Hash images and save resources #566

Maingron opened this issue Oct 15, 2023 · 7 comments

Comments

@Maingron
Copy link

What about hashing images before / after compressing and comparing them to a local table of already compressed images, maybe even just from the same run? This should be a cheap way to occasionally save some resources. Also, this would mean that if we compressed the same image with a heavier algorithm beforehand, we can in the future get the same compression when running a light algorithm (Just by copying and verifying the older compressed image).
Said table could be stored in the OS's temp folder. If we have a match, we should also verify it's actually the same image and not just a hash collision.
In my opinion, for this, oxipng should also add a flag to disable the cache.

I'm partially referring to #549

While here already: Thank you for this great tool!

@Maingron
Copy link
Author

Maybe I should mention why I chose some of my words / clarify a bit:

  • Obviously, the hash table would be best in the temp folder. Once the user wants to clean up, it's gone too. Maybe we could also add a flag that allows a custom path for the hash table.
  • Copy the older image from its path: We can't save a copy of all compressed images in the temp folder, because this would take way too much space. Instead we have to check if the older one still exists in its location and compare everything. If there's some mismatch, delete the entry from the hash table and just go ahead as we would usually.
  • Copy the older image from its location: I guess it would be wise to only extract the actual image itself and "compile" our own image from that. This way we can ignore all metadata and whatsoever. This probably should also be the part we actually hash. I think, this would achieve a greater yield and we can just handle our metadata from there, which shouldn't be to heavy. (Talking about unnecessary or unrelated metadata and file attributes. Maybe rotation could be a thing too.)

Regarding flags, as of now, I already thought of these:

  • Disable Hash Table stuff
  • Custom Path for Hash Table This could also be relevant for privacy reasons, for example if you have customers and absolutely want to make sure nothing will mix
  • Force same algorithm with this flag, we will only use the existing result if its the same algorithm we're using. Maybe the user doesn't want a free and better result for some reason 🤷

Maybe there could also be a "TTL" defined by either the Oxipng Version or Age. Maybe the user doesn't want to keep hashes from more than 2 years ago or just wants to refresh the results after a given time since algorithms may have improved.

Everything will depend on the implementation, though. I'm just sharing my thoughts. That said - I have little knowledge about how PNGs work behind the scenes, so I don't know if the technical aspects actually work. I'm just pretty certain it would work if we hash the entire files and everything else is assuming PNGs work the way I think they do.

@andrews05
Copy link
Collaborator

Hi @Maingron

Just so I understand, the main distinction between what you're proposing and the discussions in #549 is that this will effectively handle duplicates of the same image, is that right? It seems extremely similar, so I would prefer if you could add your ideas to the existing issue and I can close this as a duplicate.

@Maingron
Copy link
Author

Maingron commented Oct 16, 2023

@andrews05 I don't think its too similar.
As I understand, #549 just keeps track of which images (or actually paths) it already processed and doesn't touch them again while we're checking if we've already worked on such image in the past and if so, we re-use what we can and work our way from there. We're still touching each and every image.

Best example I can think of at the moment: If we have a folder with 10000 images and the process gets interrupted at image 9500, we would start again at image 1. #549 would start at image 5901.

So basically we index everything for future reference while #549 just doesn't want to touch anything twice.

Another difference is that we will notice if anything about an imaged changed while #549 doesn't since it already worked on it so it doesn't bother. #549 also would only care about the very path and not about the actual data

@andrews05
Copy link
Collaborator

I'm trying to understand this at a conceptual level rather than an implementation level. Can you describe at a high level what you're actually wanting to achieve? What is your use case?

@Maingron
Copy link
Author

@andrews05 Well, let's assume I'm some non-technical user and make up some story:

It would be cool if Oxipng worked much faster. Oftentimes I have the same image within multiple projects I manage and it really can hinder my workflow when I have to wait for the same image to process within multiple directories. It should be possible for Oxipng to remember that it already processed this image, just in another directory and automatically copy it.

Also, once a week, I get some data to manage, almost like a backup, which I always run through Oxipng before doing anything else with the data. Usually the data doesn't change too much, so it would go so much faster if Oxipng remembered the work from last week. I can't manually copy the processed images from last week over however, to avoid copying data that has actually changed. Also there's just way too many sub-directories.

@andrews05
Copy link
Collaborator

andrews05 commented Nov 8, 2023

Right, thanks for the user story!
To try to summarise this and #549: Both follow a similar theme of avoiding redundant optimisation of images that have already been processed once before. The difference is that this one is about repeat copies of the same image, while #549 is about repeat processing of the same file. Do you think that's accurate?

Despite serving slightly different use cases, I still think they're conceptually related and it would be great if we could collect all these ideas together in the same topic.

@andrews05
Copy link
Collaborator

Closing in favour of #549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants