HfFileSystem's `transaction` is working counterintuitively #1733

TwoAbove · 2023-10-13T00:30:31Z

Describe the bug

Hey!

I'm trying to optimize some code that updates a HF dataset.

In short, here's the pseudo-code that best describes what I'm doing:

from huggingface_hub import HfFileSystem

fs = HfFileSystem(token=os.environ['HF_TOKEN'])

...

	def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None:
		chunks = self._get_chunk_names()
		with fs.open(f"{self.fs_path}/train-{chunk_num:08d}-of-{len(chunks):08d}.parquet", "wb") as f:
			df.to_parquet(f)
	
	def _new_chunk(self, df: pd.DataFrame) -> None:
		# Rename all chunks to be of one number higher
		chunks = self._get_chunk_names()
		for chunk in chunks:
			key = chunk.split("-")[1]
			fs.mv(f"{self.fs_path}/{chunk}", f"{self.fs_path}/train-{key:08d}-of-{len(chunks)+1:08d}.parquet")
	
		with fs.open(f"{self.fs_path}/train-{len(chunks):08d}-of-{len(chunks)+1:08d}.parquet", "wb") as f:
			df.to_parquet(f)


...


with fs.transaction:
	for index, row in tqdm(messages.iterrows()):
		if len(latest_chunk) >= DATASET_CHUNK_SIZE:
			self._update_chunk(latest_chunk, latest_chunk_num)
			latest_chunk = pd.DataFrame(columns=schema)
			latest_chunk_num += 1
			self._new_chunk(latest_chunk)

What I expect with fs.transaction to do is to group renaming and writing actions until the transaction ends and then everything is committed to the HF dataset in one commit.

The issue is that, currently, it does not group the changes, and they are committed separately. We can very quickly hit API request limits because of this. We do this chunked updates because the Github runner that we're using can't handle downloading the HF dataset and updating it in-memory, so this is the solution we came up with. We could use larger machines, but that's not sustainable in the long run - this will happen eventually.

I've looked at the implementation in hf_api and hf_file_system, and I couldn't find a way to implement this pooling - I guess that it needs server-side support.

If this something that's possible to do? Am I missing anything?
Maybe someone can propose some other method to push new rows to a HF Dataset?

Thanks!

P.S.
Here's the PR that I proposed in our repo to solve the OOM issue we were seeing https://github.com/LAION-AI/Discord-Scrapers/pull/2/files
And also some discussions about this in the Dataset itself: https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/3 https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/4

Reproduction

No response

Logs

No response

System info

- huggingface_hub version: 0.18.0
- Platform: Linux-6.5.3-x64v3-xanmod1-x86_64-with-glibc2.37
- Python version: 3.11.4
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/twoabove/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.1
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.0
- pydantic: N/A
- aiohttp: 3.8.6
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /home/twoabove/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /home/twoabove/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/twoabove/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

The text was updated successfully, but these errors were encountered:

Wauplin · 2023-10-13T08:23:47Z

Hi @TwoAbove, thanks for describing your issue. I assume this is more a feature request than a bug. We don't have any transaction-specific implementation in HfFileSystem (from https://filesystem-spec.readthedocs.io/en/latest/features.html, "The implementation of the details is file-system specific (and not all support it yet)"). @mariosasko @lhoestq what do you think about raising a NotImplementedError here instead. I actually don't understand why it's not the default by fsspec since there cannot be a generic implementation for transactions.

As you said, implementing proper transactions is something that requires work both client-side and server-side, and it will most likely not happen anytime soon. However a solution that has being introduced in v0.18.0 is to upload files one by one and make a differed commit only once - which is the closest thing we have to transactions 😄.

To do this, you'll need to use preupload_lfs_file before create_commit from HfApi (so not HfFileSystem). This is for advanced usage but I think yours is definitely one use case. Here are some resources that can be helpful to you:

(pre)upload files guide
PR introducing it: Preupload lfs files before commiting #1699
PR in datasets to use it: Reduce the number of commits in push_to_hub datasets#6269

Hope this will help you. Please let me know if you have any question.

mariosasko · 2023-10-13T12:52:29Z

Raising a NotImplementedError in HfFileSystem.transaction and HfFileSystem.start_transaction sounds good to me 🙁.

Wauplin · 2023-10-13T13:21:43Z

Raising a NotImplementedError in HfFileSystem.transaction and HfFileSystem.start_transaction sounds good to me 🙁.

Opened a PR for that: #1736

Otherwise @TwoAbove please let me know if you need further help. Otherwise we can close this issue I think. We will not implement transactions so I think explicitly raising an exception is fine for now.

TwoAbove · 2023-10-13T16:09:12Z

That makes sense. Thanks!

TwoAbove · 2023-10-13T16:16:21Z

Also @Wauplin thanks for the suggestion to upload files one by one and make a differed commit only once. That pretty much solves the main issue we were anticipating!

TwoAbove · 2023-10-13T16:52:33Z

@Wauplin Do you know if it's possible to do renaming with the same logic as #1699 (so bulk renaming?)

Wauplin · 2023-10-13T17:48:46Z

When renaming you don't need the "pre-upload" step since nothing else is re-uploaded. So yes you can have in the same create_commit call a few CommitOperationAdd (preuploaded or not), some CommitOperationDelete and CommitOperationRename all together in the same commit. You just have to make sure not to have more than ~100 operations in the same commit as you might encounter a timeout from the server at some point. This value is empiric and depends on your repo so I can't promise something for sure.

TwoAbove · 2023-10-14T01:08:52Z

Sounds good. Thanks again!

TwoAbove added the bug Something isn't working label Oct 13, 2023

Wauplin added the enhancement New feature or request label Oct 13, 2023

Wauplin mentioned this issue Oct 13, 2023

[hffs] Raise NotImplementedError on transaction commits #1736

Merged

TwoAbove closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HfFileSystem's `transaction` is working counterintuitively #1733

HfFileSystem's `transaction` is working counterintuitively #1733

TwoAbove commented Oct 13, 2023

Wauplin commented Oct 13, 2023 •

edited

Loading

mariosasko commented Oct 13, 2023 •

edited

Loading

Wauplin commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

Wauplin commented Oct 13, 2023

TwoAbove commented Oct 14, 2023

HfFileSystem's transaction is working counterintuitively #1733

HfFileSystem's transaction is working counterintuitively #1733

Comments

TwoAbove commented Oct 13, 2023

Describe the bug

Reproduction

Logs

System info

Wauplin commented Oct 13, 2023 • edited Loading

mariosasko commented Oct 13, 2023 • edited Loading

Wauplin commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

Wauplin commented Oct 13, 2023

TwoAbove commented Oct 14, 2023

HfFileSystem's `transaction` is working counterintuitively #1733

HfFileSystem's `transaction` is working counterintuitively #1733

Wauplin commented Oct 13, 2023 •

edited

Loading

mariosasko commented Oct 13, 2023 •

edited

Loading