-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match on module aliases for auto import suggestions #730
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like the direction! Some minor concern regarding performance.
connection.execute("CREATE INDEX IF NOT EXISTS alias ON aliases(alias)") | ||
|
||
modules = Query( | ||
"(SELECT DISTINCT aliases.*, package, source, type FROM aliases INNER JOIN names on aliases.module = names.module)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a DB expert, but the names table can comprise 10,000 - 100,000 rows, so I am wondering if we should run this inner join on every autoimport request (which can happen with every keystroke when rope is run inside of a language server).
Can we quickly test how much adding alias support slows down search
?
Alternatively, I'd make sure aliases
only contains the aliases to modules that exist in names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The joins are pretty fast, I made a notebook to test it out.
The join time should be dominated by the Alias table, not the Names table, because the Names table has an index on the module column. Also, here we're including a where clause which makes the left side of the join even smaller. Most DB engines are pretty good about pushing down the filter past the join and sqlite3 seems to handle it well.
I thought about this a little bit before testing out this implementations I see 3 main paths forward:
- The join approach
- Materialize the availability information in the Aliases table as a column, we'd need to be careful to always update the Aliases table whenever updating the cache. This would probably be the fastest approach, but more work.
- Keep the aliases in memory as a list or dict. We'd basically be implementing the join logic manually, but it might be really fast if the # of Aliases is very small. Then again if the # of Aliases is very slow the join should also be very fast.
@tkrabel what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach has the benefit that we never have to do any updates on the aliases tables. The names table is the source of truth of that is available to the user.
If you're happy with the performance, then let's go with the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrBago thanks for doing testing the performance notebook, the notebook brings up something that is interesting/surprising to me, in that the module search_by_name_like query is much slower than what I was expecting. A prefix search using an index should not have been that slow.
That is an unrelated issue from this PR though, so I've created another ticket for that #736, but with the fixed index the Alias query should hopefully become faster as well. 883ms for an inner join between a large table and a very small table doesn't smell right to me that seems to indicate a full table scan as well.
I'll see if I can fix this tomorrow, but in the meantime, apologies but I'll be holding off on merging this PR yet until that is fixed and then we can see the new performance impact.
@MrBago thanks for making this PR, from a quick look this looks great to me, I'm on a trip right now, so my availability to review this is quite limited for the next few weeks. Once I'm back, I'll look into this properly, but please continue the conversation for now. |
@lieryan I added the aliases to prefs so that it can be configured. When you have a min can you take a look at the PR, also I think I need to be given some kind of permission so the CI will run tests on my PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MrBago, this looks good to me on a first pass. I am going to have to hold off on merging for now for reasons mentioned below, but once I resolved that issue then I'll be taking a second look at this again.
connection.execute("CREATE INDEX IF NOT EXISTS alias ON aliases(alias)") | ||
|
||
modules = Query( | ||
"(SELECT DISTINCT aliases.*, package, source, type FROM aliases INNER JOIN names on aliases.module = names.module)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrBago thanks for doing testing the performance notebook, the notebook brings up something that is interesting/surprising to me, in that the module search_by_name_like query is much slower than what I was expecting. A prefix search using an index should not have been that slow.
That is an unrelated issue from this PR though, so I've created another ticket for that #736, but with the fixed index the Alias query should hopefully become faster as well. 883ms for an inner join between a large table and a very small table doesn't smell right to me that seems to indicate a full table scan as well.
I'll see if I can fix this tomorrow, but in the meantime, apologies but I'll be holding off on merging this PR yet until that is fixed and then we can see the new performance impact.
models.Package.create_table(self.connection) | ||
models.Metadata.create_table(self.connection) | ||
self.add_aliases(self.project.prefs.import_aliases) | ||
data = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand this correctly, this will add the aliases into the database only when the database is created. IIUC, this would need to depend on the database being re-created when preference changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I didn't look at the different ways that prefs can change, we could add a method to clear the aliases table and reset it and invoke that when the prefs are updated.
@lieryan let me know if I can help with looking at the timing. One key thing to notice in my notebook is that I intentionally bloated the database to get the timing for an extreme case. When doing the timing, one thing that I found odd was that including DISTINCT in the query seemed to make a bigger difference than I expected. I expected the database to be more efficient than python and removing duplicates, but I found that removing the "DISTINCT" and using a set in python was more efficient than pushing down to the database :/. |
Hi @tkrabel-db, apologies for the delay. I got sidetracked as I didn't get the performance goal that I was expecting few weeks ago when I initially experimented with fixing the index. But I tried again today with a fresh pair of eyes, and it now worked as I expected, after I found a couple of silly errors when creating the index in my original attempt. Now after applying PR #739, this is more inline with the performance that I was expecting for these operations:
with just the old case sensitive index, the LIKE operations would've been more like a 400ms operation. if you would update your PR to include the new index as well I can review that again. I think you may need to add an index that looks like this for the alias table (untested):
|
rope/base/prefs.py
Outdated
@@ -140,6 +140,22 @@ class Prefs: | |||
"""), | |||
) | |||
|
|||
import_aliases: List[Tuple[str, str]] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding an autoimport prefs table in #516 , can we move this there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lieryan updated my PR and added the new index. The aliases table should be small so I'm not sure how much of an impact this index will have, but it shouldn't hurt. I tried to optimize the alias query a bit, but I wasn't really able to move the needle. Let me know if you think using an alias table and join like this might be an issue. |
@lieryan When you have a few min can you take a look at this PR, I have some time and would love to move this across the finish line. |
@lieryan can you prioritize this work so that we have closure? :) |
Thanks @MrBago for implementing this PR and @tkrabel-db, @bagel897 for contributing to the discussions. I've made some changes to the preferences to align the autoimport preferences changes with #516. @all-contributors add @MrBago for code |
Description
This PR adds a table of aliases for AutoImport to use. It joins the alias with the names table to find available modules with matching aliases.
This PR is in the initial draft to verify the approach and get feedback. It is still missing:
Fixes #712
Checklist (delete if not relevant):