Faster cache in `find` #561

slevang · 2023-06-22T18:35:40Z

Fixes #560.

set didn't immediately work because previous is a dict and not hashable. Converting to a frozenset is also not straightforward because these are nested dicts.

Couple questions (haven't even run tests yet so these may be answered by CI):

~~do we need to convert this back to a list in self.dircache.update(cache_entries)?~~ yes, done
should we hash using previous["id"] or some other/better attribute to handle versioned objects? I see id has this information for files, but DIRECTORY doesn't have an id attr so we would need a conditional there to use name.

gcsfs/core.py

martindurant · 2023-06-23T12:58:53Z

How does this compare in run time to the original? I am happy with the implementation, a shame for the little extra complexity.

@ianthomas23 , do you think there is an argument for implementing dircache with dictionaries rather than lists in general? We do iterating of these mostly, but also in checks (the issue being fixed here).

slevang · 2023-06-23T13:44:03Z

How does this compare in run time to the original? I am happy with the implementation, a shame for the little extra complexity.

Almost identical, brought the 20s>10m regression on my use case down to like 21s. Hashing a dict (or set) is very fast compared to list as you suggested when the size grows large.

As mentioned I tried the set approach but it isn't straightforward to hash the cache entries which are dictionaries. A dict is basically a set with an explicit key for each entry. Then the question is just what to use for the key. I'm not totally sure what the best option is.

Here's what the cache entries actually look like for a file:

{'bucket': 'global-forecast-system',
 'contentType': 'binary/octet-stream',
 'crc32c': 'mdZlGQ==',
 'etag': 'CNPA6vLj2P8CEAE=',
 'generation': '1687502286659667',
 'id': 'global-forecast-system/gdas.20230623/00/atmos/gdas.t00z.abias/1687502286659667',
 'kind': 'storage#object',
 'md5Hash': 'PL5CFZROTL4FQlGN9ENQ2Q==',
 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/global-forecast-system/o/gdas.20230623%2F00%2Fatmos%2Fgdas.t00z.abias?generation=1687502286659667&alt=media',
 'metageneration': '1',
 'name': 'global-forecast-system/gdas.20230623/00/atmos/gdas.t00z.abias',
 'selfLink': 'https://www.googleapis.com/storage/v1/b/global-forecast-system/o/gdas.20230623%2F00%2Fatmos%2Fgdas.t00z.abias',
 'size': 972949,
 'storageClass': 'STANDARD',
 'timeCreated': '2023-06-23T06:38:06.697Z',
 'timeStorageClassUpdated': '2023-06-23T06:38:06.697Z',
 'type': 'file',
 'updated': '2023-06-23T06:38:06.697Z'}

These all look hashable which would allow converting to a frozenset, but I ran into other objects that also had a metadata attribute that was itself a dict, hence we have a nested dict.

and a directory:

{'Key': 'gdas.20230623/00/atmos',
 'Size': 0,
 'StorageClass': 'DIRECTORY',
 'name': 'global-forecast-system/gdas.20230623/00/atmos',
 'size': 0,
 'type': 'directory'}

I used name because it is available on both. But it looks like for files id stores the additional identifier that allows GCS to handle versioned objects. Should we use that instead for files?

ianthomas23 · 2023-06-23T13:55:18Z

@martindurant It looks like fsspec.dircache.DirCache itself is implemented using a dictionary

https://github.com/fsspec/filesystem_spec/blob/348d4ab5794de3b06d9899c1fc2fd2b418fe9109/fsspec/dircache.py#L48C16-L48C16

but there are some uses of it that are list-like rather than dict-like, such as _ls_from_cache and _list_objects. It seems sensible to consider more dict-like behaviour, but presumably we'd need to set up a benchmarking framework to investigate first to ensure that we're spending our time and effort wisely. Which may or may not be a priority?

martindurant · 2023-06-23T13:59:30Z

It looks like fsspec.dircache.DirCache itself is implemented using a dictionary

It is a dict of lists; I am talking about a dict of dicts.

I used name because it is available on both.

I think using the name is fine. In this case, we are not doing versioned listing, which isn't available via the list API anyway.

slevang added 2 commits June 22, 2023 14:26

use dict instead of list

97106e0

convert cache dicts back to list

f80337d

martindurant reviewed Jun 22, 2023

View reviewed changes

gcsfs/core.py Show resolved Hide resolved

martindurant merged commit f345b83 into fsspec:main Jun 23, 2023

martindurant mentioned this pull request Jun 24, 2023

Stale cache info leading to failing isfile check #562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster cache in `find` #561

Faster cache in `find` #561

slevang commented Jun 22, 2023 •

edited

Loading

martindurant commented Jun 23, 2023

slevang commented Jun 23, 2023 •

edited

Loading

ianthomas23 commented Jun 23, 2023

martindurant commented Jun 23, 2023

Faster cache in find #561

Faster cache in find #561

Conversation

slevang commented Jun 22, 2023 • edited Loading

martindurant commented Jun 23, 2023

slevang commented Jun 23, 2023 • edited Loading

ianthomas23 commented Jun 23, 2023

martindurant commented Jun 23, 2023

Faster cache in `find` #561

Faster cache in `find` #561

slevang commented Jun 22, 2023 •

edited

Loading

slevang commented Jun 23, 2023 •

edited

Loading