Quality-based Trumping during import #116

ghost · 2013-02-28T19:42:24Z

This issue was automatically migrated from Google Code.
Original author: telejes...@gmail.com (February 16, 2012 01:16:19)
Original issue: google-code-export/beets#341

hchapman · 2013-04-03T16:47:37Z

This would be superb. In my spare time I was going to try to write something like this.

sampsyo · 2013-04-03T16:51:25Z

Awesome! Now that we have a central place in the importer workflow where replacement decisions are made, this should be relatively straightforward to implement -- it's just a matter of coming up with the right design for how it should work.

On Wed, Apr 3, 2013 at 9:47 AM, Harrison Chapman notifications@github.com
wrote:

This would be superb. In my spare time I was going to try to write something like this.

Reply to this email directly or view it on GitHub:
#116 (comment)

hchapman · 2013-04-07T17:23:27Z

This is my first working with beets, so mind how naive I am.

I've written a really silly (and still crufty) plugin which tries to use eyed3 to read the Xing header of the mp3 to determine the encoding preset (e.g. v0, cbr 320) and adds a $quality name template variable. It'd be nice to know the presets because if I were quality trumping, I'd personally take v0 over a 320, but a 320 over a v2, and I do not believe the ABR is enough information to do this reliably always.

I'd like to make this into a plugin, since right now it depends on eyed3, which I do not believe beets presently does. Any chance that the plugin docs are out of date, and that the database is now extensible by plugins? It would be nice to be able to beets ls quality:v0 or similar to help sort through the library without having to inspect every single mp3, and I don't if there is another way to implement this type of behavior.

I haven't given any look into modifying the replacement decisions via a plugin yet: my brief scan of the plugin docs didn't seem to include much information on this. I'll have to scan the source.

mister-walter · 2013-04-07T18:29:32Z

Check out this thread on the Google Groups forum for information on storing arbitrary data in the beets library. Sampsyo indicated that such a feature (flexattrs) might be coming in beets version 1.2.

sampsyo · 2013-04-08T02:32:42Z

Very cool! Nice work, and not silly at all. 🐤 🌟 I really like this approach.

Regarding the dependency on eyed3: yes, we'll need to replace that with an equivalent parser for Mutagen/MediaFile. Do you know how eyed3 determines the LAME preset? Is it by parsing the "encoder" string? If so, then we can re-implement that in a fairly straightforward way.

As @mister-walter mentioned, yes, we're currently in the middle of building support for arbitrary metadata fields in the database. See also #101 for the current status of this feature. I'm planning to make it the first big feature in 1.2, which will start development in earnest soon when 1.1 is out the door (very nearly ready). This is yet another good use case for that feature.

hchapman · 2013-04-08T02:38:37Z

I think eyed3 uses the xing header which is in the blank first frame of the mp3. Encoders like LAME use it, and possibly some others as well.

I don't know how eyed3 uses this data to determine/guess the encoding preset! It's open source though...

sampsyo · 2013-04-08T07:05:48Z

Cool. That's worth investigating to see how complex the logic involved is.

hchapman · 2013-04-08T17:20:44Z

By lifting the LameHeader class and three utility functions from eyed3, I've added super rudimentary LAME header support to mutagen in this branch (super rudimentary means that it just tries to find the LAME header, and spits it out while it opens an mp3 file, with no actual functionality). It's understandable that this code isn't in mutagen, which doesn't seem to be concerned with this extra data (especially since it may or not have an equivalent in any other type of media file), but it was interesting to see how it's certainly not impossible to add it.

To test:

import mutagen.mp3
mutagen.mp3.Open(MP3_FILE_NAME)

sampsyo · 2013-04-08T19:01:45Z

Cool; thanks. Looks like this header is actually substantially more complex than I originally imagined—the Xing header is yet another arcane binary format. FWIW, there's discussion (years old now) on the Mutagen bug tracker for including similar functionality:
http://code.google.com/p/mutagen/issues/detail?id=66

There's a patch but it needs some cleaning up. Maybe the best course of action here would be to pick up that patch, which seems to have been abandoned by the original author, clean it up, add tests, and land it in Mutagen for the next release. (The Mutagen maintainers have been pretty receptive lately.)

Who knew this little bit of data would be so onerous to recover?

hchapman · 2013-04-08T19:33:49Z

That explains a lot; I didn't even consider mutagen wouldn't be hosted on github, and I didn't think to look at googlecode/bitbucket. I'll give that a look over.

hchapman · 2013-04-11T17:55:38Z

The patch applies to present master branch! It works for one of my songs, but none of the test files included with mutagen at present have any preset data :(.

hchapman · 2013-04-15T16:23:20Z

I've put the functionality merged into today's master/default (whatever it's called with hg!) with testing on my Bitbucket fork of mutagen. I've also told the mutagen issues tracker. I haven't heard anything on the issue yet, though.

sampsyo · 2013-04-15T16:32:31Z

Awesome, thanks for taking action! Let's see what the authors think; if necessary, we can email the Quod Libet mailing list in a few days.

hchapman · 2013-04-17T16:59:33Z

Well- it does not seem like there is any "way" for a plugin to, at present, influence the importer in checking for duplicates, or even to modify the manual duplicate resolver in the UI (e.g. make it say "we have this album as MP3 in the database, but you're importing a FLAC"). Correct me if I'm wrong! I'm going to think about how to add this broad functionality in.

sampsyo · 2013-04-17T17:06:54Z

That's correct; the hook doesn't exist yet. We'll need to add it in the resolve_duplicate method of ImportSession in importer.py.

hchapman · 2013-04-17T17:13:42Z

Ideally the plugin (at least one which knows the user's preference to replace MP3 with FLAC) would be UI-agnostic, and as of now resolve_duplicate just throws a NotImplementedError (and the implementation in ui doesn't call the super method as of yet). My initial thought was to add the hook right before resolve_duplicate is called.

I'm not used to thinking this much about software infrastructure. It's fun! Please let me know if the issues thread is not an appropriate place for my musing...

sampsyo · 2013-04-17T17:17:49Z

Yes, absolutely—you're right that invoking the plugin before the call to resolve_duplicate makes more sense. That way, subclasses can override that method without replacing this new functionality. Good call.

And yes, absolutely, this issue tracker is the perfect place for this kind of discussion! To me, this is more organized and therefore useful than the mailing list or ad-hoc emails... there's some hope of coming back to the discussion later when we need to remember the decisions we made.

hchapman · 2013-04-23T16:11:51Z

Moral conundrum: I'm not used to parallel programming so... It'd be nice to add task duplicate data into Importer task, so that plugins could add or remove duplicates automatically by some score calculation... It's easy to me to just carry along duplicate data as, say, a list, rather than calling _check_duplicates twice. But this crumbles under the completely possible scenario where a user has e.g. FLAC and MP3 in the import query for the same album, and the user would like the FLAC to auto trump the MP3. But it's not clear to me that we can just reliably hold on to any ad-hoc album data added to recents in user_query and expect it to still be correct by the time the better quality comes through...

I originally hoped this wouldn't need any sweeping infrastructure changes. Just saying where I'm stumped, and how the coroutine solution is pretty brilliant and still magic to me..

sampsyo · 2013-04-23T17:03:23Z

Yes, the recent check is a complicated wrinkle in all of this.

One simplification that may help: all tasks hit the database in order. That means that, in the apply_choices coroutine, duplicate albums are already guaranteed to be in the database, even if they were just imported. That means that if the plugin can make all of its decisions there, without user input, then its job is simplified.

So a plugin could prompt the user (or not) based on the limited data available in recent during user_query and then do the heavyweight comparisons later, in apply_choices. Would that make sense?

hchapman · 2013-04-24T01:26:57Z

Here's a branch with a preliminary solution. I've moved everything dealing with duplicates (including the user query...) to the apply_changes. It's pretty much entirely untested, and I'm presently not really happy with the implementation.

This beetsplug uses it and it also uses my branch of mutagen (about which I've still heard nothing, although I've still only just posted on the googlecode issue) to get preset data. It's a rudimentary quality trumper which prefers FLAC > V0 > 320 > [etc...]

hchapman · 2013-04-24T14:37:26Z

I've updated my branch so that all tests pass, but I haven't written any new tests. Curious question: there was an issue in _duplicate_check where cur_artist threw an AttributeError... But the same never happened in the item counterpart...

hchapman · 2013-04-24T15:15:04Z

It just hit me why moving a user query for the duplicate check is not a good solution. I'm also trying to work in a score system for duplicate replacement calculation instead of a binary take-it-or-leave-it

sampsyo · 2013-04-24T17:36:14Z

Great; I'm glad that makes sense. Yes, all communication with the user has to be done in the same pipeline stage.

hchapman · 2013-04-24T20:33:42Z

I've changed my strategy to a "trump score" system paralleling the distance that's in now for MB matching. Progress on this branch is preliminary, but works for a really rough manual test I've run myself. Automated tests don't fail, but there are no new ones to test my code.

sampsyo · 2013-04-24T21:09:25Z

Interesting approach! I like the change of recent from a set to a dict so we keep around the album itself rather than just its identifier. We should definitely use this (but ensure that it doesn't leak memory...).

And the scoring system is interesting too, but I can also see a more specific interface (which might ask "given these albums, which is the best?" instead of "what's the score for this album?") being somewhat simpler. It would push some of the intelligence into the plugins rather than burdening the import flow more, which is already frighteningly complex. Does that make sense?

On Apr 24, 2013, at 1:33 PM, Harrison Chapman notifications@github.com wrote:

I've changed my strategy to a "trump score" system paralleling the distance that's in now for MB matching. Progress on this branch is preliminary, but works for a really rough manual test I've run myself. Automated tests don't fail, but there are no new ones to test my code.

—
Reply to this email directly or view it on GitHub.

hchapman · 2013-04-24T21:21:22Z

The primary reason behind the score would be an (immutable, i.e. not changing after the tracks are imported into the library) number which wouldn't require holding on to any further data of items being imported. Perhaps plugins would be able to declare which properties about recent imports to hold on to? This doesn't really parallel anything in beets that I can think of at present however... Comparing relatively would involve holding on to some of this recent import information specifically, and since the changes haven't been applied yet, I'm not sure how messy this could get.

An example reason for this would be quality trumping-- right now quality trumping involves having the mutagen file info somewhere and I don't know for sure if the actual import/moving into library could cause some sort of rare race conditions when trying to grab/parse this information from tasks long finished.

Certainly with a score system the idea would be to present the user with a choice if the new import score doesn't exceed some threshold (which would ideally only happen if the new import completely eclipses an old; better quality, all the same tracks, same length, same release year...), perhaps with some delta information to explain to the user why we're scoring duplicate choices as such.

sampsyo · 2013-04-24T22:09:35Z

Scoring seems just fine as a general approach to this problem; I'm just questioning whether the logic that handles scores should be done in importer.py or in qualitytrump.py. If at all possible, it would be great to move as much as possible (e.g., finding the max, totaling up track_scores and album_scores, etc.) to plugins to simplify the importer itself as much as possible. Of course, it's possible there's no clean way to do that.

To avoid races, one approach you could take in the plugin is to calculate scores when import tasks are first created (handling the import_task_start event) so you have all scores available for later comparisons. You can keep an internal map from tasks to scores or just assign your scores right on the task objects (thanks, Python!).

telejester · 2013-05-10T21:05:38Z

Trump attachment processing is a bit less open-ended by adding the semantic framework of #111. In my original ticket for Trumping on gcode, I mentioned image handling and cue and log files.

In that vein, I just added some more specific ideas to the attachment ticket (gcode 109). In the name of redundancy prevention, please see https://code.google.com/p/beets/issues/detail?id=109#c21

Trump processing needs to consider the attachments of each candidate album and item as they are moved. Images ought to be trumpable as well, with higher-res images replacing lower-res versions of the same image. I pointed at a resolution and scale-independent image comparison library which ought to help a lot in determining "sameness" although I can't vouch for how well it works in real life.

Cue and log files are interesting in that they are only valid if an album consists of the complete set of items they refer to. Any item or album trumping means the cue and log have to go, even if the new version doesn't have replacements.

telejester · 2013-05-10T21:08:06Z

I also have no idea what to do with assets which have been trumped. I suppose the options are "delete" (which I find terrifying) and "mark and move", but to where? A "discard" path?

hchapman · 2013-05-14T20:40:13Z

beets already seems to be content skipping/removing albums at user query, so perhaps the trumping could (possibly unless the user informs otherwise) provide a summary of/query for the destructive actions it will take at the end of an import? I am not entirely clear if this fits into the structure/morals of the project however.

squisher · 2013-11-16T19:06:05Z

Ping, how come this patch is collecting dust? :-) I need something like this before I can attempt to actually put beets to use...

sampsyo · 2013-11-18T01:14:33Z

There are a few outstanding issues awaiting a champion:

Add LAME preset detection. This is actually orthogonal and should maybe have its own ticket.
Add logic for short-circuiting the duplicate detection. This should be in core. In its most basic form, a config option makes the decision.
Either in core or in a plugin, develop a flexible/complex set of rules for making decisions automatically. This can come after the basic work is done.

Please feel free to chip in on any of the steps.

daks · 2013-12-08T16:07:14Z

This issue (the original one on googlecode) resume exactly what I want. I'll track it.

johtso · 2014-03-05T01:31:17Z

This would be a very useful feature to have!

Alderian · 2014-03-10T13:43:56Z

Hi,

This would be very nice, I'm doing this manually with 'duplicates' plugin using "beet dup -F -f '$path $bitrate'"... and then a manual rm... very simple, but painful

It would be useful to have the list of duplicates any time an import finds them. I mean, program will find a file already in library and have the newcomer. What if the program just detail the tags, that are different, of the two of them, this way I could choose wisely. This could be done the same way the program tells whenever its correcting tags. Ie:

/foo/bar/Megadeth/Rust in Peace/06 Lucretia.mp3
Tagging track: Megadeth - Lucretia
(Similarity: 100.0%)
This item is already in the library!
Candidates:

Old (mp3, 128kbps)(encoder, bitrate)
New (mp3, 196Kbps)(encoder, bitrate)
1, 2 to keep only that file, **[S]**kip new, Keep all, Remove old?

I'm considering that I might want to have lots of versions of any file.

With that information, one can easily decide what to do... also, I guess that adding this would be easier, just to wait for the other options. I don't know how to program in python to add this functionality myself :( I might learn, fork and contribute :)

Just my 2 cents

Thanks for this great program...

franciscolourenco · 2014-11-16T15:07:01Z

I'm also interested in having V0 V2 320kb available to use in paths

franciscolourenco · 2014-11-16T18:10:49Z

Would it be possible to specify which identifiers are used to detect duplicates?
Just like %aunique identifiers, but for duplicate management, so that beet importdoesn't ask about duplicates if, for example, the formats are different.

sampsyo · 2014-11-16T19:32:53Z

@aristidesfl Sounds like a reasonable extension, yes.

Schweinepriester · 2015-06-16T14:52:12Z

push :)

lazka · 2015-08-13T09:28:36Z

mutagen trunk (next version) will now expose lame version and bitrate mode. Feedback welcome.

See https://bitbucket.org/lazka/mutagen/issues/66

sampsyo · 2015-08-13T16:30:05Z

Fantastic! Awesome work. I'll look into exposing this in beets.

ekjaker · 2015-10-09T09:37:53Z

Fantastic idea this addition to have this at import!

"Would it be possible to specify which identifiers are used to detect duplicates?
Just like %aunique identifiers, but for duplicate management, so that beet importdoesn't ask about duplicates if, for example, the formats are different."

+1 on that!!

And also, at import the option to set default actions would be very helpful, like always take new. Ofcourse, it this gets implemented to its full extent with a scoring system based on bitrate, availability of log/cue,... would be even better than a basic new/old option.

Joshfindit · 2015-11-30T06:10:07Z

Sorry to complicate this, but I have something to add:

Joint stereo vs Separate stereo.
Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

lazka · 2015-11-30T10:27:53Z

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

Why?

smlx · 2015-11-30T12:14:52Z

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

The mid-side stereo transform is lossless, and reduces redundant encoding of common information between channels.

Joshfindit · 2015-11-30T15:48:47Z

I notice a difference.

On Nov 30, 2015, at 8:15 AM, smlx notifications@github.com wrote:

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

The mid-side stereo transform is lossless, and reduces redundant encoding of common information between channels.

―
Reply to this email directly or view it on GitHub.

franciscolourenco · 2015-11-30T17:31:12Z

@Joshfindit did you double blind test / abx?

Joshfindit · 2015-11-30T22:15:46Z

Yes, many times over the years.

For those that are skeptical, I'll put these on the table:

at higher bitrates, I hear it much less. At 320, it might be one part of one song on an album, for example.
I always use headphones that go above 20k, and seek out clarity of sound (yes, I don't always use devices that output that high, but I do when I really want to enjoy the music)
I have hearing damage in the mid-vocal range. This may cause my brain to overcompensate with the higher frequencies. (Have not tested it, but it's a point for the skeptics, so there it is)
there is some social aspect to this as well; many tracks that I have had in the past have just been poorly ripped; the joint stereo is more of a canary in these cases

On Nov 30, 2015, at 1:31 PM, Aristides notifications@github.com wrote:

@Joshfindit did you double blind test / abx?

―
Reply to this email directly or view it on GitHub.

m8ram · 2016-02-06T16:27:07Z

I'm experimenting with beets for the first time and noticed that when I import a directory containing both ogg vorbis and MP3 files it imports only the MP3 files (exactly the opposite of what I want).
Is there a configuration option to change this or do I need to wait for this feature to be implemented?

sampsyo mentioned this issue Dec 8, 2013

duplicate album: return information about file format and encoding #467

Closed

sampsyo mentioned this issue May 25, 2014

Importer: Resolve duplicate tracks in a single directory with different format/quality #791

Open

sampsyo mentioned this issue Jun 13, 2014

Show quality when prompting to import duplicate album #821

Closed

howardjones mentioned this issue Aug 18, 2014

Basic summary information for duplicates #900

Merged

untitaker mentioned this issue May 9, 2015

Quality-based Trumping during import google-code-export/beets#341

Open

ekjaker mentioned this issue Oct 15, 2015

import formatting hard to parse mentally #1593

Open

lazka mentioned this issue Jun 25, 2016

Lame VBR Preset quodlibet/mutagen#66

Closed

sampsyo mentioned this issue Feb 25, 2017

Skip user input on duplicate import decisions when files match exactly #2455

Open

RollingStar mentioned this issue Mar 21, 2017

Handle 0-byte (moved/deleted) albums differently on import dupe detection #2486

Closed

sampsyo mentioned this issue Dec 17, 2018

Is there a way to get VBR's to be imported as V0/2, rather than beets giving the average bit rate upon import? #3094

Closed

sampsyo mentioned this issue Sep 6, 2022

Add upgrade option for duplicates #4471

Open

3 tasks

Quality-based Trumping during import #116

Quality-based Trumping during import #116

Comments

ghost commented Feb 28, 2013

hchapman commented Apr 3, 2013

sampsyo commented Apr 3, 2013

This would be superb. In my spare time I was going to try to write something like this.

hchapman commented Apr 7, 2013

mister-walter commented Apr 7, 2013

sampsyo commented Apr 8, 2013

hchapman commented Apr 8, 2013

sampsyo commented Apr 8, 2013

hchapman commented Apr 8, 2013

sampsyo commented Apr 8, 2013

hchapman commented Apr 8, 2013

hchapman commented Apr 11, 2013

hchapman commented Apr 15, 2013

sampsyo commented Apr 15, 2013

hchapman commented Apr 17, 2013

sampsyo commented Apr 17, 2013

hchapman commented Apr 17, 2013

sampsyo commented Apr 17, 2013

hchapman commented Apr 23, 2013

sampsyo commented Apr 23, 2013

hchapman commented Apr 24, 2013

hchapman commented Apr 24, 2013

hchapman commented Apr 24, 2013

sampsyo commented Apr 24, 2013

hchapman commented Apr 24, 2013

sampsyo commented Apr 24, 2013

hchapman commented Apr 24, 2013

sampsyo commented Apr 24, 2013

telejester commented May 10, 2013

telejester commented May 10, 2013

hchapman commented May 14, 2013

squisher commented Nov 16, 2013

sampsyo commented Nov 18, 2013

daks commented Dec 8, 2013

johtso commented Mar 5, 2014

Alderian commented Mar 10, 2014

franciscolourenco commented Nov 16, 2014

franciscolourenco commented Nov 16, 2014

sampsyo commented Nov 16, 2014

Schweinepriester commented Jun 16, 2015

lazka commented Aug 13, 2015

sampsyo commented Aug 13, 2015

ekjaker commented Oct 9, 2015

Joshfindit commented Nov 30, 2015

lazka commented Nov 30, 2015

smlx commented Nov 30, 2015

Joshfindit commented Nov 30, 2015

franciscolourenco commented Nov 30, 2015

Joshfindit commented Nov 30, 2015

m8ram commented Feb 6, 2016