Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global function to hash string values #3679

Closed
AlttiRi opened this issue Feb 19, 2023 · 5 comments
Closed

Global function to hash string values #3679

AlttiRi opened this issue Feb 19, 2023 · 5 comments

Comments

@AlttiRi
Copy link

AlttiRi commented Feb 19, 2023

It would be useful to have a global function to hash string values in order to use it in f-string templates.
This function should return a hex-string hash (md5, sha1 are more than enough).

Something like string_hash(content) -> "a1b234ff..." (string -> utf8 string bytes -> hash bytes -> hex string).

It can be used in a meta download archive when there is no information what a post was edited.

In this case you can create a postprocessor that will fire the first time and only when the post was edited (when content/description/caption/... property was changed.)

A postprocessor could look something like this:

"postprocessors": [{
   "name": "metadata",
   "mode": "custom",
   "event": "post",
   "archive": "~/gallery-dl/gallery-dl-postprocessors.sqlite",
   "archive-prefix": "\fF {category}",
   "archive-format": "{id}{'_' + string_hash(content)[:10]}",
   "filename": "\fF {id}{'_' + string_hash(content)[:10]}.txt",
   "content-format": "{content}\n",
   "mtime": true
}]

Note: \fF — to use f-string formatting (for using a regular Python code inside {}). [:10] just to limit the hash string length, since 10 character is pretty enough.


This approach will work fine with Kemono, Pixiv, for example. (They have not information about text content editing. Mastodon have it, in comparison.)

mikf added a commit that referenced this issue Feb 22, 2023
@ClosedPort22
Copy link
Contributor

ClosedPort22 commented Feb 25, 2023

This is also useful for detecting edits to text content on DeviantArt (e.g. journals and status updates). In this case it's not even necessary to use a postprocessor.

@mikf
Copy link
Owner

mikf commented Feb 26, 2023

I added hash_md5() and hash_sha1() in 56039d2.
Is this good enough or did you have something else in mind?

@AlttiRi AlttiRi closed this as completed Feb 26, 2023
@AlttiRi
Copy link
Author

AlttiRi commented Feb 26, 2023

It's strange, but it does not work: Fixed.

"postprocessors": [{
    "filter": "subcategory in ('artworks', 'work')",

    "name":  "metadata",
    "mode":  "custom",
    "event": "post",

    "archive": "~/gallery-dl/gallery-dl-pixiv-postprocessor.sqlite",
    "archive-prefix": "\fF {category}",
    "archive-format": "{id}_{hash_sha1(caption)[:10]}",

    "directory": "metadata",
    "filename":  "\fF [{category}] {user['id']}—{id}—{user['name']}—{date.strftime('%Y.%m.%d')}—{title}—{hash_sha1(caption)[:10]}.html",
    "content-format": "\fT ~/gallery-dl/templates/pixiv.html",

    "mtime": true
}]

pixiv.html:

<div id="{id}">
  <h4>
    <a href="https://www.pixiv.net/artworks/{id}">{title}</a> by <a href="https://www.pixiv.net/users/{user[id]}">{user[name]}</a>
  </h4>
  <div class="content">{caption}</div>
  <hr>
  <div>{user[id]}—{id}—{date:%Y.%m.%d %H:%M:%S}{frames[0][delay]:?<br>Ugoira delay: / ms/}</div>
  <hr>
  <div class="tags">{tags:?["/"]/J", "}</div>
  <hr>
</div>
pixiv: An unexpected error occurred: KeyError - ARTWORK_ID

Version 1.25.0-dev


The related note: Use -O archive= for runs without processor archives (#3565)


Edit: here is the fixed line:

"filename":  "\fF [{category}] {user['id']}—{id}—{user['name']}—{date.strftime('%Y.%m.%d')}—{title}—{hash_sha1(caption)[:10]}.html",

@AlttiRi AlttiRi reopened this Feb 26, 2023
@AlttiRi
Copy link
Author

AlttiRi commented Feb 27, 2023

The only bug is a postprocessor's filename does not support \fF formatting (f-string using).

@mikf
Copy link
Owner

mikf commented Feb 27, 2023

Found the problem.

{user[id]} only works in regular format strings.
In f-strings, you need to use {user['id']} etc. Otherwise id gets replaced with its value before it is used to access user[...].


On a side note: Archives now support setting PRAGMAs like journal_mode = WAL (762a689) and I kind of remember you suggesting something in that regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants