-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multi-valued tags #505
Comments
Related to the genre-specific feature request in #119. See also some discussion in a half-baked pull request, #437. This will be a reasonably large task, involving changing the interface to MediaFile to support lists, making commensurate changes to the database abstraction, and sorting out the mapping for each of the file formats we support. |
Musicbrainz now supports multiple values for artist as well. While keeping the names themselves as a list might be cumbersome (you'd need to keep track of the join phrases), supporting lists for the artist ids should fit right into this use case. |
Congratulations to @geigerzaehler for what was obviously quite a lot of work for PR #527. As I understand it, this laid the foundation for multi-value tags, specifically adding support for genre lists. I am very interested in seeing this support added for the artist tags as well. I would be willing to spend some time on this myself. I have read through #527's thread and spent a small amount of time looking over those commits. I obviously have a lot of stuff to search through and the information is definitely there for me to find, but I was hoping somebody more familiar with this could give me a few tips on some files/commits that I might want to begin with. |
To clarify a bit about what I think @pprkut was talking about, MusicBrainz does support multiple artists in its schema. It also provides access to that through the API. Here are a few examples: http://musicbrainz.org/ws/2/release-group/38d1077e-c270-49a8-92ca-87e7eb8d9fe4?inc=artists So the information is there. The only concern I'd have about whether or not to implement this is that not every player supports multi-artist tags. Perhaps the best option would be to provide a config option to either use the multi-artist tags or join them as a single string (as it works now). |
Hi, @Boehemyth—glad you're interested! Yes, the recent work in MediaFile is the first step toward making this work more generally. The next step—which is arguably even more challenging—is to cleanly extend the database abstraction to store multi-valued fields. All of this should be done in the I have no idea how to best to best go about adding this feature set to dbcore. We should somehow allow every kind of field (fixed, flexible, and computed) to have a list value while still—for compatibility and simplicity where lists are less likely to be useful—providing a non-list version of the field. That is, it would be great if code could refer to It's also not clear what the underlying SQLite schema should be. I'm open to suggestions if anyone has ideas. Thanks for volunteering! |
I agree. MusicBrainz returns a seperator for each artist except the last one. If we were to store these seperators in a new field as a list in the same order as the artists in the artist field, we could write a function to construct the string.
I'm not exactly sure how we should go about this either. I think our best bet is to store the list as a single string with a seperator character for the individual elements. I think that's how some of the audio tags handle this anyhow. Maybe it makes sense to just stick with that? I'd have to look back at that PR to remember how it was determined to be handled. The icing on the cake is that we wouldn't need any changes to the current database schema. It may also be possible to store entire python data structures as BLOBs in the database. I'd have to investigate that to be sure, but I personally like the other option better anyways.
The one limitation to the string idea is that it won't work for other SQLite datatypes. I don't think that that will be a problem. From where I stand, there aren't any non-string fixed attributes that could benefit from a list. I'm not as sure about flexible or computed fields. I can't think of a case myself, but maybe somebody else can? |
The EchoNest API provides the raw analysis data they use in determining their higher-level attributes like danceability, etc. For example, they do beat detection, the result of which is simply an array of sample or frame indices with confidences or probabilities for each. Of course, this is a massive amount of data that I doubt anyone would be interested in keeping inside a beets database. Anyway, it's a (far-fetched) use case for lists of floats. |
Here are two reasons for adding list-valued fields everywhere:
Querying would also be complicated by packing multiple values into a string. We'd have to split each value and compare against the query rather than just pattern-matching. |
These are good points. Unfortunately, that means we have to do this the standard way, which means implement new tables, which I was really hoping to avoid, because it really will be a pain. I'll use album-artists as an example. We'd have to create a new table that lists the album/albumartist relationships. Then for queries, we have to query the album on that table to get all of the artists. I know it sounds pretty round-about, but that method actually does work remarkably well. I do understand what you mean now about how much work this will end up being. The issue is that if we really want to make this work for EVERY field, things are going to get messy very quickly. There are some things we could do to help clean it up a bit. For instance, to use the artist example again, we could have an Artist table that stores relevant info about an artist. We'd still need an album-artist table and item-artist table, but at least now you only have two artist relationship tables instead of 6 (including artist_sort and mbartist_id). Also, I'm not entirely sure how that would work for flexible fields. The question is can we assume a one-to-many relationship. For instance, in the case of Attachments, each attachment is related to exactly one Album or Item right? If that can be assumed about flexible fields, we could probably just create a new table for each flexible field, instead of adding the flexible field to the album or item table. |
Perhaps we might be able to get away with not implementing lists with a few of the fixed fields that we know doesn't need this functionality. Dates for instance? Or Track/Disk numbers? |
I actually don't know whether I agree with this—I think creating one-off tables for a few fields where lists are important could be messier in the long run! I'm certain that, in the future, we'll think of new fields (existing and novel) that will need to be listified, and needing to retrofit each of these independently sounds a bit like a nightmare. For maintainability, it would be awesome to have a generic facility for list-valued fields that can be reused with zero (or near-zero) additional code. Ideally, this would work by devising an ingenious and efficient mechanism for adding lists to every field, whether we need it or not. This would make listification orthogonal to the field set (abstraction FTW!). I realize that this might be needlessly complex, though—in which case we can consider a design that can't provide lists for every field but needs only tiny code changes to support multiple values for a new field. (Stuff like dates could be excluded if that's harder for some reason, as you proposed.) |
That's awesome @Boehemyth. Personally, I think that there is not that much to work left in the MediaFile API, apart from refactoring, refining, and documenting. Adding support for lists in tags was motivated by the native ability of most tagging formats to store those. We should keep away any additional logic that would be better fitted to database models. But I noticed you already shifted towards the database issue. As for the discussion that’s unfolding I just would like to point out that it’s probably best to delay writing code and first put something written up on the wiki. There is already a lot of great conceptual stuff there. It also might be time to push for a document oriented database. |
We also have to consider the user interface: How will we present lists to the user and let them modify these lists. |
By listifying every field, we would essentially be having a separate table for every field. This is inefficient, messy, and (I believe) unnecessary. However, your concern is valid. I think the best solution would be one where we can incrementally convert fields to lists without having to convert everything at once. This does mean we would have to support both list/non-list fields, which will be a pain up-front, but shouldn't be too much of a problem in the long-run. So long as it doesn't take much effort to move a field to the list system down the road if it is deemed necessary. I have an idea for this that I'm currently looking at. I'll post again when I have something to present.
@geigerzaehler brings up a really good point, and I honestly don't know right now. I kind of want to focus on the backend implementation right now. |
Yes, it would be great to avoid a separate table for every list-valued field! I can only think of two ways around this, both of which have obvious issues:
These obvious roadblocks are exactly why this issue has stagnated for so long… |
I agree about the code-writing part. It's best to have a plan before getting too deep.
I can do that once I have a more concrete suggestion. Where on the wiki would I post such a thing?
That would probably make things easier, but I have no experience in such a thing. |
I know it has been awhile, but I've been looking into this issue in my spare time. Based on what I've learned about python's sqlite3 package, relational database design, and how beets currently works, I've come to a number of conclusions:
SQLite provides the REGEXP keyword that tells it to use a user-defined function to see if there is a regular expression match.
Now we can do something like this:
And it will return any row where the column matches the patern. If we were to use a separator character that we know won't conflict with any character in any value, we can store it like that. For illustration's sake, let's say we used a semicolon for the seperator character. Then we can query using a regular expression like
I haven't dove into the Query class enough to understand exactly how querying the database in beets works, but I think that this could be an acceptable solution. Can someone with a better understanding of beets think of any major problems or complications with this idea? |
Thanks for all the thoughts on this! We should indeed get moving on list-valued fields. I'm a little confused about your comment about conversion to strings. I am pretty sure we don't do this—can you explain why you think we're converting all values to strings? In that light, packing all lists as separated strings is somewhat worrisome. Serializing and deserializing is a lot of overhead to pay for every query. Especially for numeric fields, where this will require a bunch of manipulation on each row to do any comparison. Another way we could do this would be to effectively make all fields into flexible fields: put it all in one big id/key/value table. There are also indirection overheads there, but at least the possibility of optimization (unlike with serialization to strings). |
I'm sorry. I think I didn't look at the types.py carefully enough. I misinterpreted a comment in there, and I didn't look at the code enough to confirm what I thought I read.
I agree. Especially since number values aren't being stored as strings. But I still think it's the best option without doing a complete redesign of the database schema. And I definitely think it's the best option if we want list support for every field.
How would we optimize such a table? The only thing I can think of to do would be to index by the item/album #. This would still be extremely inefficient. We'd literally need one SQL for every field insert/update/read. Items have almost 30 fields. That's 30 INSERT SQLs just for adding a single item to the database. Add support for lists and that number can be even bigger. Reading/Updating the database would be just as inefficient unless we also indexed by column name, but that will just make inserts take even longer, and it still wouldn't be as fast as a single query to the database. Unless I'm missing a better way to optimize this, I'm having a really hard time agreeing that this would be preferable to serializing/deserializing strings. |
You make lots of great arguments for why the key/value approach would also be inefficient. Clearly, what we really want here is a document-oriented database and we're working around the fact that we don't have one. Maybe the most helpful thing to do here would be to hold a "bake-off". To me, it's not completely obvious to me which approach is faster—key-value or string serialization. An ambitious contributor could try implementing tiny, toy versions of each. Both implementations should implement exactly the same interface. They could then write a couple of benchmarks that demonstrate the performance characteristics and argue—with data!—that one is faster than the other. Any takers? |
I think the effort would be better spent on moving to a document oriented database. |
@geigerzaehler |
@sampsyo I can't help but think that there has got to be some software out there already that can do this the way we want. I did a little looking into the document-oriented databases that were mentioned in the wiki, and while they would be really awesome to work with, I think they're a step back from our current database. Perhaps somebody else would like to look into it to see if they agree. Still, I can't help but feel there should already be a solution to this. |
Yes, there are suspicious performance problems with some of the alternative databases. But if we're going to be comparing with data, it might also make sense to compare against those just to make sure our intuition is correct! One project that caught my eye (not finished, but has the right idea for our use case): https://github.com/dw/acid/ And yes, it would totally be a good idea to settle on the interface. We should come up with something clean and regular. I don't exactly know what it should look like, but I like the idea of allowing simple access when you just need a single value alongside list access when you need all the data. Not sure if that's possible to support elegantly, but it's nice to dream. 😉 |
Acid does look pretty interesting. At a quick glance, it seems quite a bit more powerful than the other options I looked at. I am having a hard time figuring out exactly where the project is at. Do you know anything about that? |
No, it's a little opaque in terms of project status. Maybe it's worth contacting the author? |
I was tinkering with the join table for flexible attributes and lists (i guess this is kind of horizontal partitioning). I don't know how much "magic" it would need for creating the query from user input + "metadata" table, like in CMSs which you can customize the fields. The idea would be like this: CREATE TABLE songs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT
-- ...default fields
);
-- list type
CREATE TABLE songs_artists (
id INTEGER,
ord INT DEFAULT 1,
name TEXT NOT NULL,
joiner TEXT, -- feat
PRIMARY KEY (id, ord),
FOREIGN KEY (id) REFERENCES songs(id),
) # example
fields = OrderedDict({
'title': 'songs.title',
'artists': 'GROUP_CONCAT(songs_artists.joiner || songs_artists.name, "")'
})
joins = [
'JOIN songs_artists on songs.id = songs_artists.id
]
filter_subselect = f"SELECT id from songs_artists where {criteria}"
sql = f"SELECT {','.join(fields.values())} FROM songs {joins} WHERE songs.id IN (filter_subselect)"
# fetch and create objects / DTOs etc |
Sure! This seems like about the right design to make arbitrary files into lists. I'd be interested to investigate what the potential performance trade-offs are… |
For PostgreSQL I've seen people recommend converting dbs with JSON fields to regular string fields and rely on downstream software to decode the JSON when dealing with pgsql versions prior to native JSON support. Maybe beets could also just use (This also seems compatible with the |
@Freso that'd mean we'd need two versions of our queries to degrade nicely for versions of SQLite without |
@jackwilsdon I think there are some other great advantages, e.g., that you can store lists of non-string objects (incl. other lists!), which is not possible with the "CSV" approach (regardless of what we decide our "C" to be). (We haven't found a use case for these though, but it's nice to go down a path that is known to be easily extendable.) I do agree that it's a bother, and I wouldn't mind if it was decided to be a beets 2.0.0 release feature that simply has a hard requirement for |
What's wrong with just using the standard relational DB option of a join table? Losing indexing and query support seems like a poor trade-off for storing simple tabular data as JSON. |
@Freso I'm certainly not against storing it as JSON, my point was mainly around we the fact that we wouldn't be able to use any advantages of I guess one thing to look at is % of installations that support |
That's great news; thanks for the heads up! |
Could something like this python package, supersqlite, be used to bridge the gap until the features we need for json support are officially added? |
Since python 3.9.3 is now officially released, it should have json1 support (python/cpython@58d6f2e). Has anybody begun working on migrating this? I suspect that it will be an arduous task. If not, I might try taking a stab at it. @beetbox Will we need to include backwards support for older databases or would it be better to have a forced database upgrade? I would love to use jsons to store lists and there are quite a few other features that could benefit as a result of this upgrade. It seems that a lot of the work necessary for the conversion has already been done on #2503 if anyone else wants to try porting over the changes. |
Personally, I don't see an issue with forcing a database upgrade. That's probably simpler to implement and more robust than attempting to be backwards compatible? AFAIK, up to now, beets doesn't have any serious database upgrade code (except for adding new columns to the album and item tables), but I don't see a problem with adding it. We may consider a commandline prompt where users need to acknowledge the irreversible upgrade and which we could use to encourage backing up the database first. I'm not so sure that we want to require Python 3.9.1 (released in August 2020) just yet, that's still somewhat recent. I don't think we have any idea what Python versions are common among beets users, unfortunately. I wouldn't say that we need to retain support for any Ubuntu LTS that's still out there. But maybe at least 3.8? That's not to say that work on this feature can't be started now, but I guess a few more opinions would be nice to set a realistic timeline for actually shipping json1. |
I don't want to be dismissive, but I just wanna say: What would the venn diagram of "Ubuntu users" + "beets users" look like? I bet there's not much overlap. Now if we were to say: "Ubuntu users" + "Picard users" maybe there's quite a bit of overlap. Either way, those who are running Ubuntu (unless they use pip) are unlikely to upgrade to a new version of beets until a new release of Ubuntu appears, at which point the package gets upgraded on the Ubuntu repos. So if you want to serve those using Ubuntu, but not pip, you'd fix up a database transition from latest Ubuntu/Ubuntu LTS to the next version of beets. And if you want to serve those using pip on Ubuntu, perhaps it's better to just ask them to upgrade the relevant packages. They did choose to install from pip. Ubuntu users eventually learn that using Ubuntu means a commitment to running the version of the packages specifically packaged for their version of Ubuntu. If you don't, you can mess up the system stability. Truth is, it's a big commitment to run latest packages on Ubuntu. Which brings me to my next point: Why not let us choose when to migrate the database? Perhaps you can leave a setting to use the single-value database until we can migrate. I don't know how big of a job it is to leave legacy support though. You have to figure that out yourselves. That's my two cents as user and not developer. edit: i came back here by chance and i noticed i have two thumbs down. at the very least you should explain which of the things i said you disagree with. kinda useless to just add an emoji and call it a day. hardly an argument, i think. |
Another option could be to require different python versions per OS. MacOS should support json1 starting with python 3.7, while I think most linux OSs have been compiling their sqlite with json1 for a while (Ubuntu since at least 16.04). So really the python requirement would be 3.7 on MacOS, 3.9 on Windows, and a somewhat recent linux distribution. |
You have that option by choosing when you update beets. If you’re not ready to take the jump, just don’t update beets with the new db schema until you are. 🤷 The massive overhaul this will be to how beets works internally and with its database would make it extremely clunky to support the two different schemas simultaneously. It’s already a task that we’ve been wanting to do/implement ASAP for 7+ years—requiring backwards compatibility with old databases would only add to the already massive task of accomplishing this. |
What's the latest status update on this? |
This is the one feature of Beets that's keeping me from going "all in". I have thousands of files that are tagged with multiple genres. |
I'm attempting to add this for This is my first time ever dabbling in the beets codebase, so I probably won't know what I'm doing for a while, lol. But nonetheless, I'm determined to make this work. Perhaps @jpluscplusm could share some wisdom regarding the albumtypes addition from a few weeks ago? I'm thinking the above tags could work similarly |
After carefully reading through the prior comments, I am wondering why json seems to be preferred over using a simple null-valued separator in a string field? I think the performance impact would be minimal, and require the least headache in terms of dependencies, migration, etc. I would also think making non-list fields into json for the sake of 'fields being agnostic' would incur more compute cost. At this point, the issue has been open for almost 10 years. I think most users who care about this issue will take any solution. I will be attempting the 'null separator in a string field' solution to this problem. |
@JOJ0 I think we can close this (right before the 10 year mark 😅) |
Please add support for multi-valued tags. For example, this could be used for the multi-valued occurrences of PERFORMER / COMPOSER / ENSEMBLE / etc. recommended by http://age.hobba.nl/audio/mirroredpages/ogg-tagging.html .
The text was updated successfully, but these errors were encountered: