Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mixup of paths in put function #1179

Closed
wants to merge 1 commit into from
Closed

Conversation

quassy
Copy link

@quassy quassy commented Feb 7, 2023

put currently sorts the list of input paths so when they are not in order before already, they don't match their output paths anymore and file names get mixed up. This fixes fsspec/gcsfs#468

`put` currently sorts the list of input paths so when they are not in order before already, the don't match their output paths anymore and file names get mixed up. This fixes fsspec/gcsfs#468
@quassy
Copy link
Author

quassy commented Feb 7, 2023

Note: In for lpath, rpath in callback.wrap(zip(lpaths, rpaths)):, zip should probably be called with strict=True.

I didn't remove sorting completely to stay backwards compatible and also because rm seems to rely on the returned order.

@martindurant
Copy link
Member

@ianthomas23 , do you have time to review this?

@ianthomas23
Copy link
Collaborator

@ianthomas23 , do you have time to review this?

Yes, I can take a look tomorrow morning, UK time.

@ianthomas23
Copy link
Collaborator

@quassy Thanks for looking at this. I think the solution ends up being more complicated, as is often the case. At the end of expand_path if you don't sort out then your return is in the order that out gives you, and out is a set, so the order is not the same as the original order. A set is used because we want lookups to be quick.

To proceed with this I think there are a few options:

  1. Use an ordered set, i.e. a set that preserves insertion order. There isn't an ordered set in the python standard library though. There is a package on PyPI but it isn't very actively maintained and we generally try to avoid adding new dependencies to fsspec.
  2. Use an OrderedDict instead of the set, as there is one in the standard library. We only really want it for the keys so the values can be anything that may or may not be useful.
  3. Keep the set but also add a list or similar as well, so we can keep the quick lookup but also the order that was passed in.

Personally I think option 2 is the best, although there might be alternatives that I haven't thought of yet. Are you prepared to try out an OrderedDict in this PR?

@ianthomas23
Copy link
Collaborator

Issue #897 is related to this too.

@quassy
Copy link
Author

quassy commented Feb 8, 2023

@ianthomas23 Thanks as well for the quick response! Everything is trivial, except...

I'm not actually sure anymore if expand_path for lpath makes sense at all and what the expected behaviour of put is, when you do something like this below, because the order of glob is not really known:

# src_directory/c.txt
# src_directory/b.txt
put(lpath=["src_directory/*.txt"], rpath=["gs://dst_bucket/c_path/", "gs://dst_bucket/b_path/"])
# at the moment b.txt would be written to c_path

put(lpath=["src_directory/*.txt"], rpath=["gs://dst_bucket/c_path/", "gs://dst_bucket/b_path/", "gs://dst_bucket/a_path/"])
# this would fail as other_paths() requires the two resulting lists to be the same length

Some ideas which I'm all not super happy with:

  1. Maybe it would be cleaner if put() didn't expand paths at all. Rather a library user does that on their own before and then pass two ordered lists (or a list of tuples (lpath, rpath)) that remain mostly untouched. Quite bad for backwards compability.
  2. Only call expand_path() on lpath if the second argument is a string (or a list with 1 element) so order does not matter and it's clear where glob-matched files get written to.
  3. To be backwards compatible add a parameter to put() called expand_lpath=True so one can set it to False and skip the expand_path call as well as the unintentional reordering. But this would still lead to unexpected behaviour if you didn't know that expand_path is actually reordering your inputs.

@ianthomas23
Copy link
Collaborator

This is what I use most commonly which does need the wildcard expansion:

fs.put(lpath="source/*", rpath="target")

but the outcome is the same regardless of sorting or not as the target filenames are determined one a time from each of the wildcard-expanded source filenames. Actually this doesn't do what it should in all circumstances (I am looking into this) so what I tend to use is the inferred wildcard for lpath:

fs.put(lpath="source/", rpath="target")

Of your 3 options I don't like 1 or 3, so 2 seems the best (or least bad anyway!).

What should happen in various scenarios? If a user specifies a list for either lpath or rpath then we should not sort the list at all, it would definitely be wrong to do so. If one of lpath and rpath is a list and the other is a single string with a wildcard (or maybe just a trailing slash?) then the expanded list has to be sorted as it has to be deterministic, we can't rely upon glob returning any particular order. What if both lpath and rpath are wildcards? We have to sort both to be consistent with the other behaviour. It seems like a dangerous thing for a user to try to do, they would do better to stick with your approach of explicitly passing both as lists which shouldn't be resorted. We could prevent wildcard expansion on both in the same call but I don't think that is necessary if we are sorting both as somebody will want to make use of this approach even though I personally would avoid it.

I know this needs more thought, but so far I think this brings us back to changing expand_paths to not change the order, unless it is doing wildcard expansion when it needs to sort. I don't know yet if it needs an input argument to override this behaviour, and it might even need to return some flag to tell the caller if sorting (or wildcard expansion) has been performed. I think the latter might help with some other related problems.

We have made a few changes in this area of the code recently, so if we need to do something backwards-incompatible then this might be a good time to do it.

@martindurant
Copy link
Member

@ianthomas23 , happy to have a live chat about the best way forward here

@ianthomas23
Copy link
Collaborator

@ianthomas23 , happy to have a live chat about the best way forward here

Good idea, I have booked a slot in your calendar tomorrow.

@martindurant
Copy link
Member

@quassy , can you please check if this is still an issue given @ianthomas23 's work in the area?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gcsfs put copies files out of order
3 participants