Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster cache in find #561

Merged
merged 2 commits into from
Jun 23, 2023
Merged

Faster cache in find #561

merged 2 commits into from
Jun 23, 2023

Conversation

slevang
Copy link
Contributor

@slevang slevang commented Jun 22, 2023

Fixes #560.

set didn't immediately work because previous is a dict and not hashable. Converting to a frozenset is also not straightforward because these are nested dicts.

Couple questions (haven't even run tests yet so these may be answered by CI):

  1. do we need to convert this back to a list in self.dircache.update(cache_entries)? yes, done
  2. should we hash using previous["id"] or some other/better attribute to handle versioned objects? I see id has this information for files, but DIRECTORY doesn't have an id attr so we would need a conditional there to use name.

@martindurant
Copy link
Member

How does this compare in run time to the original? I am happy with the implementation, a shame for the little extra complexity.

@ianthomas23 , do you think there is an argument for implementing dircache with dictionaries rather than lists in general? We do iterating of these mostly, but also in checks (the issue being fixed here).

@slevang
Copy link
Contributor Author

slevang commented Jun 23, 2023

How does this compare in run time to the original? I am happy with the implementation, a shame for the little extra complexity.

Almost identical, brought the 20s>10m regression on my use case down to like 21s. Hashing a dict (or set) is very fast compared to list as you suggested when the size grows large.

As mentioned I tried the set approach but it isn't straightforward to hash the cache entries which are dictionaries. A dict is basically a set with an explicit key for each entry. Then the question is just what to use for the key. I'm not totally sure what the best option is.

Here's what the cache entries actually look like for a file:

{'bucket': 'global-forecast-system',
 'contentType': 'binary/octet-stream',
 'crc32c': 'mdZlGQ==',
 'etag': 'CNPA6vLj2P8CEAE=',
 'generation': '1687502286659667',
 'id': 'global-forecast-system/gdas.20230623/00/atmos/gdas.t00z.abias/1687502286659667',
 'kind': 'storage#object',
 'md5Hash': 'PL5CFZROTL4FQlGN9ENQ2Q==',
 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/global-forecast-system/o/gdas.20230623%2F00%2Fatmos%2Fgdas.t00z.abias?generation=1687502286659667&alt=media',
 'metageneration': '1',
 'name': 'global-forecast-system/gdas.20230623/00/atmos/gdas.t00z.abias',
 'selfLink': 'https://www.googleapis.com/storage/v1/b/global-forecast-system/o/gdas.20230623%2F00%2Fatmos%2Fgdas.t00z.abias',
 'size': 972949,
 'storageClass': 'STANDARD',
 'timeCreated': '2023-06-23T06:38:06.697Z',
 'timeStorageClassUpdated': '2023-06-23T06:38:06.697Z',
 'type': 'file',
 'updated': '2023-06-23T06:38:06.697Z'}

These all look hashable which would allow converting to a frozenset, but I ran into other objects that also had a metadata attribute that was itself a dict, hence we have a nested dict.

and a directory:

{'Key': 'gdas.20230623/00/atmos',
 'Size': 0,
 'StorageClass': 'DIRECTORY',
 'name': 'global-forecast-system/gdas.20230623/00/atmos',
 'size': 0,
 'type': 'directory'}

I used name because it is available on both. But it looks like for files id stores the additional identifier that allows GCS to handle versioned objects. Should we use that instead for files?

@ianthomas23
Copy link
Contributor

@martindurant It looks like fsspec.dircache.DirCache itself is implemented using a dictionary

https://github.com/fsspec/filesystem_spec/blob/348d4ab5794de3b06d9899c1fc2fd2b418fe9109/fsspec/dircache.py#L48C16-L48C16

but there are some uses of it that are list-like rather than dict-like, such as _ls_from_cache and _list_objects. It seems sensible to consider more dict-like behaviour, but presumably we'd need to set up a benchmarking framework to investigate first to ensure that we're spending our time and effort wisely. Which may or may not be a priority?

@martindurant
Copy link
Member

It looks like fsspec.dircache.DirCache itself is implemented using a dictionary

It is a dict of lists; I am talking about a dict of dicts.

I used name because it is available on both.

I think using the name is fine. In this case, we are not doing versioned listing, which isn't available via the list API anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

find performance regression
3 participants