Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement bulk_add for applying several SQLite operations per transaction #3300

Closed
wants to merge 1 commit into from
Closed

Conversation

khimaros
Copy link

@khimaros khimaros commented Jun 7, 2019

Initially inspired by #1935 and seemingly relevant to #2388

Based on my snakeviz benchmarks in #1935 (comment) and #1935 (comment), dbcore overhead for an import operation on 847 local files was reduced from 95% (153 seconds) to <10% (5 seconds) .

I have exercised this patch with a dbcore frontend I built to organize, query, and de-duplicate arbitrary (including non-music) files. In order to realize these performance benefits in beets, library import will need to be modified to make use this method.

N.B. The degree of batching chosen for bulk_add is a trade-off between buffering delay, CPU time, and peak memory usage. I've found negligible memory usage for single transactions containing tens of thousands of model records.

@sampsyo
Copy link
Member

sampsyo commented Jun 7, 2019

Cool! Thanks for looking into this!!

Here's one thing I don't quite understand about this profiling work: does the overhead come from literally creating SQLite transactions, or is it our in-Python cost of looking up the transactions? We built the transaction system in dbcore to avoid the creation of redundant transactions. So if you do this:

with db.transaction():
    with db.transaction():
        db.mutate(...)

…only one SQLite transaction should actually be created. More relevantly, if you do this:

with db.transaction():
    for item in items:
        db.add(...)

…then the transaction created inside the add call should approximately be a no-op. Of course, lookup up the transaction stack for the current thread is not free! So it's possible the overhead is coming from there.

But just to confirm: did you measure the performance difference between an unwrapped add loop, a transaction-wrapped add loop, and the new bulk_add that doesn't bother creating a transaction at all? The relative differences between these differences would be enlightening…

@khimaros
Copy link
Author

khimaros commented Jun 7, 2019

@sampsyo -- indeed, performance seems to be nearly identical with the method you describe here!

I sat on this patch for a very long time and it seems the landscape may have changed. For a bit of background, when we spoke about this last October the conclusion was:

Yes, absolutely! A bulk insert would be a great way to do it. You could even imagine letting the bulk_add method accept an iterable (instead of just a list) to let new model objects get generated on the fly without materializing them all in memory.

It's possible that nested transactions were not handled gracefully back then.

If this PR doesn't add any value, let's just close it and I will switch my dbcore clients to the nested transaction solution. I'll leave the final judgment to you. If it does add value, it should have some tests. I have some staged in my local repo for the performance testing.

Thank you for continuing to build and maintain this excellent tool!

@sampsyo
Copy link
Member

sampsyo commented Jun 7, 2019

Yeah! I read that comment of mine too with great interest. It's constantly surprising to me how hard it is for me to reconstruct my own thinking from a previous era. Reading my comments, it looks like I may have been imagining that the overhead came from SQLite execute calls, and that such a bulk insertion would use one big statement? Who can really say what I was thinking, I guess.

If things are just as fast with the explicit transaction approach, I say we probably don't need bulk_add. But it might be worth documenting this phenomenon somewhere, especially if other people are going to be as intrepid as you and write their own dbcore clients!

@khimaros
Copy link
Author

khimaros commented Jun 8, 2019

There is definitely no performance reason to favor bulk_add over nested transaction:

$ nosetests -v --with-timer test.test_dbcore_perf
test_dbcore_bulk_add (test.test_dbcore_perf.PerformanceTest) ... ok (34.5063s)
test_dbcore_single_transaction (test.test_dbcore_perf.PerformanceTest) ... ok (33.9157s)

[success] 50.43% test.test_dbcore_perf.PerformanceTest.test_dbcore_bulk_add: 34.5063s
[success] 49.57% test.test_dbcore_perf.PerformanceTest.test_dbcore_single_transaction: 33.9157s
----------------------------------------------------------------------
Ran 2 tests in 68.423s

Closing this PR now, thanks for the feedback!

@khimaros khimaros closed this Jun 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants