Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small performance optimization when fetching items from library. #2798

Merged
merged 4 commits into from
Mar 23, 2019

Conversation

pprkut
Copy link
Contributor

@pprkut pprkut commented Jan 27, 2018

Fetch flexible attributes once for all relevant items instead of once per item.
For queries with larger result sets this drastically cuts down the number of issued sqlite queries.

In the end, this didn't quite give the performance boost I'd hoped it would, but it feels still slightly faster for larger queries.
Unfortunately the hot spot seems to be somewhere else though :-/

One thing to note, since the list of IDs used in the WHERE..IN clause can be quite large (and reach sqlite's query parameter limit), I resorted to not passing them as parameters.
This should be fine, since these are not values originating from input but coming from the database from a previous query. Not sure how big of an issue this is nonetheless.

Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this! It does seem like a much better idea to fetch everything up front instead of looping over each item and then issuing a separate query. 👍 👍 👍

I have a couple of questions within, including a follow-up on your concerns about the list of IDs in the SELECT query.

Have you run any simple performance comparisons (just time on a query or two) to see what the impact is, even if it didn't solve the main bottleneck?

beets/dbcore/db.py Outdated Show resolved Hide resolved
beets/dbcore/db.py Outdated Show resolved Hide resolved
beets/dbcore/db.py Show resolved Hide resolved
@pprkut
Copy link
Contributor Author

pprkut commented Feb 22, 2018

Finally had some time to rerun my performance tests under at least somewhat untainted circumstances.
I checked with time beet ls &> /dev/null, which for me is 11240 tracks. I have the db on a SSD, not sure that has a lot of impact, but mentioning it just to be aware.

Without the patch

real    0m9.766s
user    0m9.500s
sys     0m0.266s

With the patch

real    0m9.413s
user    0m9.210s
sys     0m0.203s

Both are fastest runs out of 5. Based on this I'd say on average it gives a 4% speedup.

Edit: Forgot to mention, a lof of that time seems to be in the actual output generation. Loading the entire collection in roquette takes about 3s, so about 66% less. Given that, the 0.4s speedup doesn't sound that bad actually.

@sampsyo
Copy link
Member

sampsyo commented Feb 22, 2018

Indeed! Thanks for doing the measurements—that's really helpful to know. And it helps point the way to the next bottleneck too (formatting the output).

@pprkut
Copy link
Contributor Author

pprkut commented Nov 28, 2018

Finally had time to do a revision on this. I moved the flex attr query to _fetch(), sort of like you already suggested. That way we still only query once, but we can construct a better subquery that no longer has to worry about a length limitation or SQL injection possibilities. I think this looks good now :-)

I did some further measurements, to get the actual raw performance of iterating the entire library (11832 items):

start = time.time()
for item in lib.items(query):
    continue
end = time.time()

On master this takes 3.66s average
Using a single flex attr query this goes down to 2.95s average
I have another optimization potential that I'll look into next to get this further down to 1.41s, but it'll need some more verification.

I also see some unit test failures in travis that I don't really understand (I don't get those locally), so if someone has some pointers on what to do about those I'd really appreciate it.

@sampsyo
Copy link
Member

sampsyo commented Dec 1, 2018

Nice!! I agree this is looking good—I'm convinced this is the way to go, and the code changes are nice and clean. Awesome work!!!

I'm also pretty confused by the test failures. I am able to reproduce them locally (for example, by running nose test/test_metasync.py). Looking into it now.

Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I see the problem with those tests. The issue arises when querying for model objects that have no flexattrs. In this case, _get_indexed_flex_attrs does not have an entry for the object (even an empty one). So trying to look up its flexattrs raises a KeyError.

I've proposed an easy fix in an above inline comment. One more complicated fix would be to change _get_indexed_flex_attrs to produce empty entries for other model objects, but that's probably not a good idea.

To reason this comes up in the tests for metasync, for what it's worth, is the way the harness sets up its fixtures:

items[0].title = 'Tessellate'
items[0].artist = 'alt-J'
items[0].albumartist = 'alt-J'
items[0].album = 'An Awesome Wave'
items[0].itunes_rating = 60
items[1].title = 'Breezeblocks'
items[1].artist = 'alt-J'
items[1].albumartist = 'alt-J'
items[1].album = 'An Awesome Wave'

There's one item with a flexattr and one without.

beets/dbcore/db.py Outdated Show resolved Hide resolved
@pprkut
Copy link
Contributor Author

pprkut commented Dec 2, 2018

Thanks for the explanation. All makes sense now :)
I fixed up the tests so this should be good to go now

@pprkut
Copy link
Contributor Author

pprkut commented Mar 23, 2019

Hey! Just checking, this is good to go right? If so I'll go ahead and merge it

@sampsyo
Copy link
Member

sampsyo commented Mar 23, 2019

Yes; sorry for not responding! This looks good to go. It's a small change but a very important one! I say merge away.

Fetch flexible attributes once for all items instead of once per
item. For queries with larger result sets this drastically cuts
down the number of issued sqlite queries.
This resolves the query length and potential security problems,
while keeping the performance benefits.
@pprkut pprkut merged commit 29425ed into beetbox:master Mar 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants