Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Implement persistent caching in SourceMgr #431

Closed
fabulous-gopher opened this issue Apr 21, 2017 · 15 comments
Closed

Implement persistent caching in SourceMgr #431

fabulous-gopher opened this issue Apr 21, 2017 · 15 comments

Comments

@fabulous-gopher
Copy link

From @sdboyer on December 30, 2016 11:46

With additional testing being done as part of the new tool, it's increasingly clear that we really, really need to get persistent, disk-backed caching into place. There are four basic things we need to cache:

  1. Version lists - all the versions available for a source
  2. Manifest and Lock - for a given rev and a particular analyzer, these should be permanently cacheable
  3. PackageTree - for a given rev and gps version, these should be permanently cacheable
  4. Project root computations - those that require a go-get metadata query

The biggest immediate question is how to actually store it on disk. For the first three, at least, discrete files would probably work. However, I'm inclined to store everything together using bolt. That'll cover the last case more easily, especially because it is likely to require prefix queries; it'll also probably make it a bit cleaner to store TTLs.

(supercedes #8)

Copied from original issue: sdboyer/gps#130

@fabulous-gopher
Copy link
Author

From @sdboyer on April 4, 2017 14:36

With #196 finally done, the road to this is pretty clear. There's even a TODO exactly where a handle to it should be dropped in 😄

We do need to pick a storage lib for it. LevelDB would be fine, as would Bolt.

@jmank88
Copy link
Collaborator

jmank88 commented Jun 7, 2017

I can work on this issue.

I've started sketching out a Bolt singleSourceCache implementation, and the main open question is how to encode the complex structs/interfaces. @sdboyer Do you have any ideas/preferences for serializing: Manifest/RootManifest, Lock, pkgtree.PackageOrErr, and UnpairedVersion (and their underlying types - e.g. Constraint, maybe typedString())?

@jmank88
Copy link
Collaborator

jmank88 commented Jun 7, 2017

Looking at dep's rawManifest for inspiration makes me think that most of the underlying types can be cleanly converted to string fields anyways, so I'll experiment with how far I can get with native Bolt structure (nested buckets, and/or composite keys ending in field name suffixes), before introducing another encoding layer.

@sdboyer sdboyer added the Epic label Jun 8, 2017
@sdboyer
Copy link
Member

sdboyer commented Jun 8, 2017

AMAZING!! 🎉 🎉 🎉 🎊 - this is something that would make a huge difference for users, and i've got nothing even close to the time required to work on it.

Looking at dep's rawManifest for inspiration makes me think that most of the underlying types can be cleanly converted to string fields anyways

Unfortunately, while that's adequate for a written file manifest/lock's purposes (by design), it likely won't be for gps' internal purposes. There are two issues I can think of off the top of my head:

  • branchVersion has the default field, which is not expressed in even typedString() output. We could add it, but...
  • semver.Version has an original field that...well, i guess that the String() method prints it out, but that field has bitten me so many times in subtle ways that i just don't trust it anymore. (OTOH, i'm not sure there's anything we can do EXCEPT stringify that, as semver doesn't export those fields).

And that's just with the versions. I generally tried not to allow such gotchas, but it wasn't a design goal that I explicitly kept track of, so I can't say with certainty that there aren't more such issues.

Notwithstanding all that, I think it's still the right call to prototype this with strings, where it works. Layering in an encoding system later is not difficult. And, more generally, I'd strongly prefer to keep this proceeding in smaller chunks with smaller PRs (even if they have no immediate effect), rather than building up something huge and unreviewable.

@jmank88
Copy link
Collaborator

jmank88 commented Aug 28, 2017

@sdboyer Have you given any thought to how the cache timeout should be configured/specified? My working branch is currently using a time.Duration -cache-age flag, but this PR/comment is making me wonder about environment vars

@sdboyer
Copy link
Member

sdboyer commented Aug 28, 2017

@jmank88 i'm avoiding thinking about TTLs thus far, as the information i'd like to see in the cache for the first pass should, i think, all be permanently cacheable by revision. only stuff like the version list lookups would be...oh wait, ugh. i guess that's unavoidable that we have to do something with it now.

for now, let's just do something quick and dirty. like a 60s cache. initially, at least, i'd rather we not expose a choice around this to users.

is it possible that we get some incremental PRs in, and do some branching by abstraction/feature flags, even if they're internal ones?

@jmank88
Copy link
Collaborator

jmank88 commented Aug 28, 2017

for now, let's just do something quick and dirty. like a 60s cache. initially, at least, i'd rather we not expose a choice around this to users.

I thought users would at least need/want some way of running w/o the cache, or to force a clean cache, but one minute is a lot shorter than I was imagining.

is it possible that we get some incremental PRs in, and do some branching by abstraction/feature flags, even if they're internal ones?

I've been trying to get incremental PRs in as fast as I can. Currently blocked by #1024 and #1025.
The WIP branch as currently written does not operate as a hidden feature flag, but passing a -cache-age < 0 will suppress the persistent cache and instantiate the in-mem cache only, as master does now. This flag should eventually default to a positive value (e.g. 24hr now or 1m as above), but could start with a negative default so the new code doesn't run unless requested.

@jmank88
Copy link
Collaborator

jmank88 commented Sep 25, 2017

I'm seeing speed up in every area of 'solver wall time', except for b-list-versions (most of the cost, unfortunately). I believe that the problem is that the current code doesn't even have the opportunity to use the cached data. I discovered that requiring sourceIsSetUp necessarily hits the network the first time, for all source types (no matter the state of the cache dir, or the bolt db). It triggers maybeSource.try, which for git calls listVersions, while the other types perform a (less costly) ping.

@sdboyer Can sourceIsSetUp be verified without listing versions? Is it even necessary to ping at all?

@sdboyer
Copy link
Member

sdboyer commented Oct 1, 2017

@jmank88 it's not strictly necessary to reach out and ping - we should be able to operate solely on information already in the cache (or none at all). this is basically a deficiency in the design of the FSM governing source interactions, but i opted for it initially just to simplify the model. there are two major aspects of that simplification.

first, right now, the only way to know which of the possible maybeSource URLs to use is to try them. in general, that entails network interactions. if we take a repo existing in the sources cache locally as an indication of success, then it opens the door to the possibility of network failure later. to preserve equivalence to existing behavior, we would then have to fall back to more maybeSource-like searching behavior.

but, this messes with a bunch of other things. for one, the sourceCoordinator is currently built around the idea that, if it gets a sourceGateway back for a given input, that will definitively be the sourceGateway moving forward, and that all URL mappings to it kept by the sourceCoordinator are correct. if those URLs change underfoot based on the adaptive try-more-maybeSources logic, it breaks this assumption. (i do have a solution in mind for this that involves a bit of refactoring and abstraction, though)

i think there's also a number of other assumptions woven into there that i can't even conjure in the abstract - it'd take me digging in to really see them, and i'd no doubt discover more along the way. the assurance of there being exactly one object in memory responsible for managing each unique location on disk seems like a major one.

the second big simplification is that just always requiring us to hit the net kept us in the comparatively-simpler-to-think-about world where we were really only ever operating on the most current version of upstream's reality. purely local information mostly just doesn't matter in dep today. this allowed me to defer on a bunch of trickier modeling questions that, while quite crucial in the long run (if we can't trust local reality, then dep is still crippled by left-pad events, even when code is committed to vendor - the very thing that's supposed to protect against that), i adjudged to not be strictly required for the first round of work.

prior to the introduction of persistent caching, the benefits of relying on local realities were also far fewer, and didn't outweigh the costs. now that you're working on persistent caching, though, the balance there has changed, and it's worth looking at this seriously.

side note: another benefit to revisiting the FSM and not requiring network activity (or even a local repo) will be that we can teach sourceGateway.exportVersionTo() to grab tarballs from hosting providers that support it directly, rather than requiring a full clone and working from that. that'll be huge in CI for projects that don't commit vendor.

@jmank88
Copy link
Collaborator

jmank88 commented Oct 7, 2017

I did some preliminary refactoring and I'm seeing much better results now. Here are some example solver times (with the insignificant items omitted) from running dep ensure -v --no-vendor on cockroachdb.

Without cache:

Solver wall times by segment:
     b-list-versions: 1m38.935666232s
         b-list-pkgs:   11.796190703s
              b-gmal:    5.107555114s
     b-source-exists:    2.171482904s
...
  TOTAL: 1m58.564169279s

With cache:

Solver wall times by segment:
     b-list-versions: 54.052503576s
     b-source-exists:   2.26689633s
         b-list-pkgs:  436.869786ms
              b-gmal:  296.971425ms
...
  TOTAL: 57.600790947s

I still haven't fully solved the case where the mapped url changes. I'll try to isolate some of the other changes into separate PRs first.

@sdboyer
Copy link
Member

sdboyer commented Oct 8, 2017

a bit weird that there'd be so much of a drop in b-list-versions here, but hey, it's good to see the improvements 😄

tbh, i'm a bit surprised we're not seeing numbers in the <100ms, or even <10ms range. not a huge concern for right now, though - we're still in the run/right phases of "make it run, make it right, make it fast" within the scope of this particular feature. for now, i'll take <1s and be happy 😄

@jmank88
Copy link
Collaborator

jmank88 commented Oct 8, 2017

Indeed, after more tinkering I'm seeing total runtime around 1m w/o cache and 10s w/ cache, so I suspect that there were superfluous and/or redundant network calls being made and we'll get some speedup even before integrating the cache.

@sdboyer
Copy link
Member

sdboyer commented Oct 8, 2017

huh. redundant network calls in the currently-active logic, or in part of the new layer you're introducing? (what did you change when tinkering?)

@jmank88
Copy link
Collaborator

jmank88 commented Oct 8, 2017

I'm working on isolating it now to find out for sure, but I have only been messing with the gateways, not the cache itself.

@sdboyer
Copy link
Member

sdboyer commented Oct 9, 2017

that's (tentatively) great news. I would love nothing more than to have written some bugs that result in unnecessary net requests - that'd mean across-the-board speedups for everybody when you find them!

the huge gain here will be obviating the need for any network calls on git projects in Gopkg.lock and already in the cachedir by deferring the initial network call until sometime after the maybeSource.try() - but that's orthogonal to caching, and likely separate from what you're seeing anyway.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants