Releases: unreadablewxy/fs-curator
0.4.0
It seems a recent libmagic regression (detected on Gentoo and Arch) is causing webm files to be incorrectly identified. If you have them in your mono-collection, it might be a good time to ask for a patrolling read against your by-id
index
Have received some complaints that the *nix binaries are built with WAY too new glibc. So they will now be built on latest release of Debian instead of bleeding edge Gentoo.
Breaking Changes
- Risk: moderate. Deprecated
source_*
parameters has been dropped- This affects qualifier expressions of all stages of the pipeline
- This also affects transform argument generation
- Risk: moderate. Store qualifiers and path generation no longer bind
file_*
attributes (except forfile_extension
)- Offering files to stores is a self contained process. Hoppers can be configured to auto invoke this process after certain files are ingested, but should not change said process. To convey extra information when auto invoked by hoppers is contrarian to this design
- If we need per-file attributes lets design it properly as opposed to hacking pieces of it onto two colocated features
New Features
- Added inline named capture groups support for regex
- Realized through the PCRE2 library
- Yes these are still applied at a lower precedence to named constants
- Yes this means we now support match specific group attributes
- Regex qualifiers now support minimum match length thresholds
- The new value for the include config directive is
PROPERTY /EXPRESSION/FLAGS THRESHOLD
- eg: require the expression match at least 50% of the value
include = x /\d+/ 50%
- eg: require the expression match at least 12 characters
include = x /\d+/ 12
- The new value for the include config directive is
Behavior Changes
- Workflows resumed through WIP files now bypass hopper evaluation
- WIP files now contain group attributes as well as workflow parameters, allowing manual touch ups
- Store qualifiers and path generation now bind
file_extension
from the file identification process instead of copied verbatim from the imported file's path - Order assignment now sorts all files by length then character codes
- This ensures semantically correct order for variable length numbers in file names: 0, 1, 10, 11, 2, 3 (the order without length factoring)
- Another happy coincidence is this tends to cluster together similarly named files
Performance
- Removed extraneous memory allocations from INI parsing
- Removed unnecessary memory allocations for attribute matching at the cost of a bit of short lived heap fragmentation
- Time complexity of matching files has been improved from
m log(n)
tom + n
Bug Fixes
- Reduced FFMPEG warning spam when dealing with JPEG files
- A side effect of this change is that phash has started producing slightly different results
- So do not be alarmed if you see a lot of phash corrections while patrolling
by-id
0.3.0
Project now exceeds 8K lines of C++20 🎉
Breaking Changes
- Risk: minimal. PHash querying command is incompatible with previous versions and will randomly fail if used with them
- Risk: minimal. Thumbnail storage in the mono-collection has been redesigned and moved to
cache/thumbnail
. Existing deployments should delete and regenerate theirthumbnails
directory to reclaim otherwise wasted space - Risk: minimal. Hopper constants are now applied at a lower priority than NCGs. Allowing them to serve as fallback default values
New Features
- Added JPEG thumbnails support with configurable quality
Performance
- Added PHash index cashing, stored in
cache/phash
to reduce cold start delays for those with 100K+ collections- This cache is invalidated based on directory modification times and will be ineffective for those that has disabled it in their filesystems (if you need to ask, you haven't)
Behavioral Changes
- Thumbnails are hence regarded as ephemeral data and will be overwritten automatically when offered to stores
- This is done via delete-then-link, so there's no risk of corrupting the mono-collection. But this could still clobber files that are not linked by this service so please make sure your workflow is not affected before upgrading
- File importing will now disregard singular 0 byte files. A behavior sometimes exhibited by browsers
- Successfully imported directories will now be auto deleted
- Remove the necessity to specify a store in hoppers to create import only hoppers
Bug Fixes
- Fixed a bug where if a file is projected into two directories with differing thumbnail requirements only one will win over the other
- Fixed an error contextualization bug that caused a lot of errors to be mistranslated as "unknown"
- Fixed an IO bug that caused the thumbnailer to fail on some GIF files
- Fixed cli side segfaults from not enforcing argument count requirements
- Added workaround for ffmpeg bug #8747
Dependencies
For linux users, please install: libmagic1.5+, libffmpeg4.3+ (LGPL), libopencv4.5+
0.2.0
New Features
- Windows support 🎉
- Management socket location is now configurable via the
FS_MGMT_SOCKET
environment variable as well as the config file- On windows defaults to
%APPDATA%\fs-curator\socket
- On *nix defaults to
/run/fs-curator/socket
- Service mode will try to create the parent path of the management socket
- On windows defaults to
Performance
- Added codec caching in the thumbnailer
- Switched to IO buffers sized as a multiple of both modern disks sectors & typical OS memory pages in for better memory & IO efficiency
- Removed some unnecessary memory allocations when reading attributes
- PHash queries now retrieve top 3 instead of 5 most similar images unless otherwise specified
- This is the performance sweet spot for 10K+ collections
Behavior Changes
- Temporary directories generated by transforms will now be destroyed
- Empty directories, even those matching hopper qualifiers will now be disregarded to avoid infinite loops
Bug Fixes
- Fixed a rare crash that occurs when merging more than 2 groups
- Fixed a crash that occurs when the thumbnailer fails to open a file
- Program will no longer start if config file doesn't exist
- Files that fails to be identified will now be assigned the ".bin" extension instead of causing crashes
Dependencies
For linux users, please install: libmagic1.5+, libffmpeg4.1+ (LGPL), libopencv3.2+)
0.1.2
New Features
- WIP files will now indicate the group & index of the most similar file for PHash conflicts
- Added
file_name
,file_stem
, andfile_extension
as testable properties - Added hopper defined constants
- All
source_*
formatting fields & testable properties are now deprecated. See the relevant wiki article for rationale
- All
- Added configuration for logging verbosity
Behavior Changes
- Group merging will now be done by a link-then-drop instead of one rename operation to keep rollback robust and simple
- Log indicating how many files are being ingested now correctly counts files that are being dropped, as they are technically "ingested" (into /dev/null)
Bug Fixes
- In perceptual hashing
- Size limit (32MiB) is now applied consistently and clear errors are added for when exceeded.
- Collisions resolved by the
combine
action will no longer drop the new file
- Data integrity issues encountered whilst scanning groups designated for merging will now correctly trigger rollbacks
- Fixed a rare heap corruption that occurs when generating thumbnails for multiple formats
- For ingested files,
file_*
properties at the hopper level will now correctly binds to their path instead their parent directory
0.1.1
New Features
- Added perceptual hash based image similarity deduplication
- Added
ignore
conflict resolution action. Valid only for phash conflicts - Added a command to query for perceptual hash similarity
- Added "crop to aspect" thumbnailing
- Added regenerate thumbnail command
Upgrade Advisory
After first run of this release please run the following:
curator --patrol by-id
to ensure any existing files are properly assigned their perceptual hashes.
0.1.0
Breaking Changes
Risk: Low, chance of losing ordering and grouping meta-data if migration fails.
- The
by-order
index now uses directories to represent groups.- Existing collections should auto-migrate on first run.
- Once migrated, do not run older versions. Possible crash risk.
- Checksum collisions will no longer be reported via renaming
Upgrade Advisory
After first run of this release please run the following:
curator --patrol by-id
to ensure newly required attributes are assigned to all files.curator --patrol by-order
to ensure your collection don't have any 0 indexed groups or files.
Security Advisory
- In this release, the service will begin accepting Unix Domain Socket connections to accept commands. Processes residing on the same system may be able to issue commands to the daemon, so please ensure the permissions configured for the daemon's socket at
/run/fs-curator/socket
aligns with your security goals. - If incorrectly configured, transforms may become an attack vector for malicious insiders to perform elevation & arbitrary code execution attacks.
New Features
- Added
{file_extension}
as a valid store path format field. - Added work-in-progress file based collision reporting
- Append
.continue
to WIP file's name to continue with import
- Append
- Added conflict resolving actions:
combine
anddrop
- Added Unix domain sockets support for issuing control commands
- Not to be confused with IP networking.
- Protocol not finalized, use at own risk.
- Added a command to re-offer groups to stores.
curator -o | --offer GROUP [GROUP_RANGE_END] STORE_NAME
- Added Xattrs saving support for groups
- Configured in hopper scope, applied immediately after the files are imported
save = PROPERTY_NAME
- Any attrs on the file can be used in store path expressions the same way as capture groups
- Attrs that doesn't exist but referenced anyways results in failure & rollback
- These are reloaded when re-offering files to stores
Performance
- Reduced IO during startup. The
by-order
index directory will only be scanned if the cached value for next group ID is missing. - Removed random patrolling read on startup
- A full patrolling read must now be manually requested.
curator -p|--patrol by-id | by-order
- File ingestion is now top priority, thumbnailing & request processing happens after file processing completes.
- Thumbnailer will now link existing thumbnails instead of generating new ones whenever possible. It is recommended thumbnail paths do not include any file extensions. Doing so could make migrating to subsequent releases problematic.
Bug fixes
- Fixed a thumbnailer backoff bug that prevented it from running at full speed
- Fixed a crash that manifests rarely for gif files between 60~130KiBs
0.0.0
Update README.md