Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importer duplicate detection: configurable field set #1133

Open
pescuma opened this issue Dec 6, 2014 · 8 comments
Open

Importer duplicate detection: configurable field set #1133

pescuma opened this issue Dec 6, 2014 · 8 comments
Labels
feature features we would like to implement mediumsize

Comments

@pescuma
Copy link

pescuma commented Dec 6, 2014

Currently, beets uses only album and artist. I'd like to use year too. Maybe make it configurable?

(this discussion started in #1131)

@sampsyo sampsyo changed the title Duplicate detection: allow to specify fields Duplicate detection: configurable field set Dec 6, 2014
@sampsyo sampsyo added the feature features we would like to implement label Dec 6, 2014
@sampsyo
Copy link
Member

sampsyo commented Dec 6, 2014

Thanks for filing this. To summarize the other discussion, you wanted to add year so that albums with the same name could be considered non-duplicate.

@jonathanthomas83
Copy link

I have a similar issue where a single has the same title as an album for the same artist, in most cases the year will be the same too so it's great to see that you've widened the solution to configure fields without limitation, allows me to use albumtype, which will be ideal for this scenario.

@sampsyo sampsyo changed the title Duplicate detection: configurable field set Importer duplicate detection: configurable field set Jan 26, 2016
@Vrihub
Copy link
Contributor

Vrihub commented Feb 3, 2017

Any chance to see this implemented?

BTW, when importing in --singletons mode it seems that the duplicate detection code uses ony author and title, and this bug gets quite annoying. e.g. for jazz music it's common to have many version of the same song by the same artist.

@sampsyo
Copy link
Member

sampsyo commented Feb 3, 2017

I don't mean this to sound dismissive, but you can make the chances much higher by doing some work on it yourself! The current status is that we think this is a good idea and important; it's just a matter of finding someone with the time to implement it.

@jackwilsdon
Copy link
Member

I was looking into adding this but it turns out it might run a lot deeper than I initially thought. The importer uses beets.importer.ImportTask#chosen_ident (or beets.importer.SingletonImportTask#chosen_ident) to get the artist and album name (or title of track if single item) of the item being imported. For SingletonImportTask this isn't really a problem, as it uses the Item itself which has a year field.

The main issue seems to be with modifying how beets.importer.ImportTask#chosen_ident works, as it uses class instance variables which store the current artist and album for the import.

Both cur_artist and cur_album are populated in beets.importer.ImportTask#lookup_candidates, which gets them from beets.autotag.media.tag_album. This in turn gets them from beets.autotag.media.current_metadata which does provide the year in it's return value.

Now we could just fetch cur_year from likelies (the return value of beets.autotag.media.current_metadata) and return it alongside the others, however this seems like it might end up being a bit of a hack. The reason I'm saying this is that we end up using the cur_artist and cur_album to look up the album within beets.autotag.media.tag_album, using (what I believe to be) external sources such as MusicBrainz and plugins. If we don't use cur_year when looking up and just return it then it feels like a bit of a hack to me.

We can't even easily use cur_year when looking up the album, as it uses the candidates hook (see here) which in turn dispatches to plugins, meaning any changes made will be breaking.

I may have completely missed something simple here, but from my preliminary look it doesn't seem like this will be an easy thing to add 😢.

What do you think @sampsyo?

@sampsyo
Copy link
Member

sampsyo commented Feb 4, 2017

Hi, @jackwilsdon! It's great to hear from you.

You're right; this is a bit complicated. Here's the fundamental problem, I think: find_duplicates needs to have a notion of the "artist" and "album" for the newly-imported music. When the metadata actually comes from MusicBrainz or another metadata source, this isn't too hard—we can just read the information off of the AlbumInfo or TrackInfo object. This is reflected in the second branch of the if in both implementations of chosen_ident.

But when we have an as-is import, where the information comes from the current metadata on the files, things are more complicated. How do we get the "current artist" for a collection of tracks? We can only guess it by taking the majority. That's why there's the coupling with current_metadata, which is also used to determine the initial search criteria.

I think the right approach is probably to make current_metadata (and likelies) turn up more fields. Instead of just having a fixed pair cur_artist and cur_album, we could have a flexible dict full of all the fields we could possibly want to use for duplicate detection. Does that make sense?

@jackwilsdon
Copy link
Member

Hey @sampsyo, sorry for pretty much disappearing off the face of the earth recently! 😆

Does current_metadata not already turn up multiple fields? From the code it looks like it supports these fields:

fields = ['artist', 'album', 'albumartist', 'year', 'disctotal', 'mb_albumid', 'label', 'catalognum', 'country', 'media', 'albumdisambig']

Regarding your last point; would tag_album accept a list of fields to use when tagging? How would this work with the candidates hook?

@sampsyo
Copy link
Member

sampsyo commented Feb 5, 2017

Well I'll be damned; I totally forgot that current_metadata already did that! Of course, we use this for metadata matching. Good point!

Currently, tag_album returns cur_artist and cur_album, which come from current_metadata, along with the Proposal object. I think my proposal amounts to returning the entire dictionary of current metadata from that function. Then, in lookup_candidates in beets.importer, where we currently have:

self.cur_artist = artist
self.cur_album = album

we'll instead have:

self.cur_metadata = cur_metadata

to save that for later. Finally, chosen_ident can then use any fields from this to generate the identifier for each album.

Does that make any sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature features we would like to implement mediumsize
Projects
None yet
Development

No branches or pull requests

5 participants