Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Remodel SQLite Schema #5477

Merged
merged 58 commits into from
Sep 9, 2019
Merged

Remodel SQLite Schema #5477

merged 58 commits into from
Sep 9, 2019

Conversation

efritz
Copy link
Contributor

@efritz efritz commented Sep 4, 2019

The Problem

The old SQLite database would store everything in a document blob, which is a gzipped and json-encoded set of the following data:

  • ranges in the document
  • monikers attached to ranges
  • package information attached to monikers
  • definition results shared by multiple ranges in the document
  • reference results shared by multiple ranges in the document
  • hover results shared by multiple ranges in the document

This allows us to have a simple SQL query that would return the blob of data that we needed to answer any query that doesn't involve in looking at two files.

However, after removing the result sets at import time in order to stop the server from having to trace graph edges at query time (which is undesirable -- why do it on every query when you can only do it once) the size of the definition and reference results became apparent. Note that this didn't cause a problem itself, but it did reveal the extent of the problems of one that already existed.

This created much larger document blobs, which will become a problem at some point. In order to answer queries quickly about a range in a large document, it may be necessary to pull multiple megabytes of unrelated information from a SQLite file.

The Solution

Re-model the database so that documents no longer track their own definition and reference results (but they do retain their ranges, monikers, package information, and hovers). Profiling has shown that the OVERWHELMING proportion of the data is in these two fields.

We now put definition and reference results in another table. However, experiments over the Labor day holidy showed that data at this scale will be infeasible to store per-row (the overhead for tuples is too high at insertion, and too large on disk). We need to do some similar gzipped and json-encoded trickery.

So far, we can't: store it all in one giant blog (it would be much larger than a document), store it along with a document or as a sibling of a document (it would not be easy to share the same definition or reference results between documents), or store it in individual rows (due to the required throughput of the converter and the rarity of rare earth materials required to produce enough disk space).

What we can do is shard definition and reference results over several rows, with a size that scales dynamically with the size of the input dump. Then, any identifier for a definition or reference result from a document will be able to determine (with the same hash function and the total number of chunks) the id of the result chunk. This requires loading a second blob for definition and reference results, but these can be cached in memory in the same manner as document blobs are cached in memory. See code for details!

Results

Uploading is now 2-3x faster 🎉 for select benchmarks from this document.

@codecov
Copy link

codecov bot commented Sep 4, 2019

Codecov Report

Merging #5477 into lsif-clean-sqlite will increase coverage by 0.07%.
The diff coverage is 96.62%.

@@                  Coverage Diff                  @@
##           lsif-clean-sqlite    #5477      +/-   ##
=====================================================
+ Coverage              47.39%   47.47%   +0.07%     
=====================================================
  Files                    745      747       +2     
  Lines                  45876    45919      +43     
  Branches                2711     2704       -7     
=====================================================
+ Hits                   21742    21799      +57     
+ Misses                 22112    22093      -19     
- Partials                2022     2027       +5
Impacted Files Coverage Δ
lsif/src/cache.ts 98.24% <100%> (+0.98%) ⬆️
lsif/src/xrepo.ts 100% <100%> (ø) ⬆️
lsif/src/default-map.ts 100% <100%> (ø)
lsif/src/backend.ts 76.74% <100%> (-0.53%) ⬇️
lsif/src/inserter.ts 92.3% <100%> (ø) ⬆️
lsif/src/database.ts 84.57% <88.6%> (-1.46%) ⬇️
lsif/src/util.ts 90.9% <88.88%> (-9.1%) ⬇️
lsif/src/importer.ts 98.67% <98.61%> (+0.61%) ⬆️
lsif/src/correlator.ts 99.33% <99.33%> (ø)
cmd/gitserver/server/serverutil.go 51.21% <0%> (-1.7%) ⬇️
... and 24 more

@efritz efritz added the lsif label Sep 4, 2019
@felixfbecker felixfbecker added the team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) label Sep 5, 2019
@efritz
Copy link
Contributor Author

efritz commented Sep 9, 2019

@chrismwendt @lguychard I would actually like to merge this into the other sqlite branch so that we don't pollute master with two large commits (one that kind of undoes the other), and will also give @felixfbecker a chance to do another pass of https://github.com/sourcegraph/sourcegraph/pull/5332 without having to do a weird context switch (I assume it would be easier since he's been living in the other set of diffs).

@efritz efritz requested a review from beyang as a code owner September 9, 2019 19:45
@efritz efritz merged commit c97c9e6 into lsif-clean-sqlite Sep 9, 2019
@efritz efritz deleted the lsif-sqlite-simplify-db branch September 9, 2019 19:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lsif team/graph Graph Team (previously Code Intel/Language Tools/Language Platform)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants