Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HfFileSystem's transaction is working counterintuitively #1733

Closed
TwoAbove opened this issue Oct 13, 2023 · 8 comments
Closed

HfFileSystem's transaction is working counterintuitively #1733

TwoAbove opened this issue Oct 13, 2023 · 8 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@TwoAbove
Copy link

Describe the bug

Hey!

I'm trying to optimize some code that updates a HF dataset.

In short, here's the pseudo-code that best describes what I'm doing:

from huggingface_hub import HfFileSystem

fs = HfFileSystem(token=os.environ['HF_TOKEN'])

...

	def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None:
		chunks = self._get_chunk_names()
		with fs.open(f"{self.fs_path}/train-{chunk_num:08d}-of-{len(chunks):08d}.parquet", "wb") as f:
			df.to_parquet(f)
	
	def _new_chunk(self, df: pd.DataFrame) -> None:
		# Rename all chunks to be of one number higher
		chunks = self._get_chunk_names()
		for chunk in chunks:
			key = chunk.split("-")[1]
			fs.mv(f"{self.fs_path}/{chunk}", f"{self.fs_path}/train-{key:08d}-of-{len(chunks)+1:08d}.parquet")
	
		with fs.open(f"{self.fs_path}/train-{len(chunks):08d}-of-{len(chunks)+1:08d}.parquet", "wb") as f:
			df.to_parquet(f)


...


with fs.transaction:
	for index, row in tqdm(messages.iterrows()):
		if len(latest_chunk) >= DATASET_CHUNK_SIZE:
			self._update_chunk(latest_chunk, latest_chunk_num)
			latest_chunk = pd.DataFrame(columns=schema)
			latest_chunk_num += 1
			self._new_chunk(latest_chunk)

What I expect with fs.transaction to do is to group renaming and writing actions until the transaction ends and then everything is committed to the HF dataset in one commit.

The issue is that, currently, it does not group the changes, and they are committed separately. We can very quickly hit API request limits because of this. We do this chunked updates because the Github runner that we're using can't handle downloading the HF dataset and updating it in-memory, so this is the solution we came up with. We could use larger machines, but that's not sustainable in the long run - this will happen eventually.

I've looked at the implementation in hf_api and hf_file_system, and I couldn't find a way to implement this pooling - I guess that it needs server-side support.

If this something that's possible to do? Am I missing anything?
Maybe someone can propose some other method to push new rows to a HF Dataset?

Thanks!

P.S.
Here's the PR that I proposed in our repo to solve the OOM issue we were seeing https://github.com/LAION-AI/Discord-Scrapers/pull/2/files
And also some discussions about this in the Dataset itself: https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/3 https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/4

Reproduction

No response

Logs

No response

System info

- huggingface_hub version: 0.18.0
- Platform: Linux-6.5.3-x64v3-xanmod1-x86_64-with-glibc2.37
- Python version: 3.11.4
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/twoabove/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.1
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.0
- pydantic: N/A
- aiohttp: 3.8.6
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /home/twoabove/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /home/twoabove/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/twoabove/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
@TwoAbove TwoAbove added the bug Something isn't working label Oct 13, 2023
@Wauplin
Copy link
Contributor

Wauplin commented Oct 13, 2023

Hi @TwoAbove, thanks for describing your issue. I assume this is more a feature request than a bug. We don't have any transaction-specific implementation in HfFileSystem (from https://filesystem-spec.readthedocs.io/en/latest/features.html, "The implementation of the details is file-system specific (and not all support it yet)"). @mariosasko @lhoestq what do you think about raising a NotImplementedError here instead. I actually don't understand why it's not the default by fsspec since there cannot be a generic implementation for transactions.

As you said, implementing proper transactions is something that requires work both client-side and server-side, and it will most likely not happen anytime soon. However a solution that has being introduced in v0.18.0 is to upload files one by one and make a differed commit only once - which is the closest thing we have to transactions 😄.

To do this, you'll need to use preupload_lfs_file before create_commit from HfApi (so not HfFileSystem). This is for advanced usage but I think yours is definitely one use case. Here are some resources that can be helpful to you:

Hope this will help you. Please let me know if you have any question.

@Wauplin Wauplin added the enhancement New feature or request label Oct 13, 2023
@mariosasko
Copy link
Contributor

mariosasko commented Oct 13, 2023

Raising a NotImplementedError in HfFileSystem.transaction and HfFileSystem.start_transaction sounds good to me 🙁.

@Wauplin
Copy link
Contributor

Wauplin commented Oct 13, 2023

Raising a NotImplementedError in HfFileSystem.transaction and HfFileSystem.start_transaction sounds good to me 🙁.

Opened a PR for that: #1736


Otherwise @TwoAbove please let me know if you need further help. Otherwise we can close this issue I think. We will not implement transactions so I think explicitly raising an exception is fine for now.

@TwoAbove
Copy link
Author

That makes sense. Thanks!

@TwoAbove
Copy link
Author

Also @Wauplin thanks for the suggestion to upload files one by one and make a differed commit only once. That pretty much solves the main issue we were anticipating!

@TwoAbove
Copy link
Author

@Wauplin Do you know if it's possible to do renaming with the same logic as #1699 (so bulk renaming?)

@Wauplin
Copy link
Contributor

Wauplin commented Oct 13, 2023

When renaming you don't need the "pre-upload" step since nothing else is re-uploaded. So yes you can have in the same create_commit call a few CommitOperationAdd (preuploaded or not), some CommitOperationDelete and CommitOperationRename all together in the same commit. You just have to make sure not to have more than ~100 operations in the same commit as you might encounter a timeout from the server at some point. This value is empiric and depends on your repo so I can't promise something for sure.

@TwoAbove
Copy link
Author

Sounds good. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants