Automated normalization and curating of media collections. Written in Python 3.x.
Curator is a collection of stateless CLI tools, following the Unix philosophy, to organize large collections of heterogeneous media. Each tool creates a plan made of tasks with clearly defined input and output files, which the user can optionally review before applying.
Install the package via:
pip install git+https://github.com/AlexAltea/curator.git
Acknowledgements to people who contributed code/ideas to the project:
- Victor Garcia Herrero: Mathematician, Machine Learning expert and tamer of scoring functions.
Curator can automatically rename and link media files, edit container metadata, remux and merge streams. Reducing manual labor and achieve reliable results across different media from potentially different sources, some tools rely on signal processing and machine learning (e.g. Whisper, LangID).
Highlighted use cases (current and planned):
- Filter media by container and stream metadata (all).
- Rename files based on existing filenames (
curator-rename
). - Merge streams from multiple related containers (
curator-merge
). - Detect audio/subtitle language from sound and text data (
curator-tag
). - Rename files based on existing metadata and databases (
curator-rename
). - Synchronize audio/subtitle streams (
curator-merge
andcurator-sync
). - Remove scene banners from subtitles (
curator-clean
). - Detect watermarks in video streams (
curator-clean
andcurator-merge
). - Select highest quality audio/video streams (
curator-merge
).
Below you can find a description and examples of all tools provided by Curator:
flowchart LR
Convert --> Merge --> Sync --> Tag --> Rename
Merges all streams with identical names into a single container, except for:
- Video streams, if one already exists.
- Audio streams, if one with the same
language
tag already exists.
Requires all video containers to be MKV.
Update filenames according to a pattern made of the following variables:
Key | Description |
---|---|
@ext |
File extension of the input media. |
@dbid |
When using a database, the ID of the match, e.g. imdbid-tt12345678 . |
@name |
Localized name of the media. |
@oname |
Original name of the media (needs database). |
@tags |
Tags present in the input media filename enclosed by square brackets, if any. |
@year |
Year the media was released. |
Synchronize streams via data cross-correlation.
Every synchronization task involves (A) a reference stream, and (B) the stream we want to synchronize. We name this relationship as A ← B. Curator can only handle the following types of synchronization tasks:
- Video ← Audio:
Comparing lip movement timestamps with ASR timestamps. - Audio ← Audio:
Comparing sound data. - Audio ← Subtitle:
Comparing ASR timestamps with uniquely matching text timestamps. - Subtitle ← Subtitle:
Comparing text timestamps.
The synchronization plan (SyncPlan
) will create a tree of synchronization tasks (SyncTask
) for every media file it processes. For example, with an input Media("movie.mkv")
with streams: #0
(video), #1
(audio:eng), #2
(audio:spa), #3
(subtitle:eng), #4
(subtitle:spa), it will genarate the following sync proposals:
#0
←#1
#1
←#2
#1
←#3
#3
←#4