Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize between uploading file and generating index thread #4443

Open
wants to merge 64 commits into
base: master
Choose a base branch
from

Conversation

wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Apr 18, 2023

Reasons for making this change

#4370 does not have synchronization between threads (upload file thread and generate index thread), so it does not work for large files like imagenet (158 GB).

Related issues

Screenshots

Checklist

  • I've added a screenshot of the changes, if this is a frontend change
  • I've added and/or updated tests, if this is a backend change
  • I've run the pre-commit.sh script
  • I've updated docs, if needed

Copy link
Member

@epicfaace epicfaace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove the .pyc files? Also let's add them to .gitignore

@@ -286,7 +286,7 @@ def _get_azure_sas_url(self, path, **kwargs):
account_name=AZURE_BLOB_ACCOUNT_NAME,
container_name=AZURE_BLOB_CONTAINER_NAME,
account_key=AZURE_BLOB_ACCOUNT_KEY,
expiry=datetime.datetime.now() + datetime.timedelta(hours=1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we refactor this into a constant? Also, is 10 hours enough or might we need more for even larger files?

self._current_max_buf_length = len(self._bufs[0])
for i in range(1, self.NUM_READERS):
self._current_max_buf_length = max(self._current_max_buf_length, len(self._bufs[i]))
# print(f"find largest buffer: {self._current_max_buf_length} in thread: {threading.current_thread().name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments

@@ -10,11 +10,16 @@ class MultiReaderFileStream(BytesIO):
"""
NUM_READERS = 2

# MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Can you add a description of what MAX_BUF_SIZE is used for?

@@ -10,11 +10,16 @@ class MultiReaderFileStream(BytesIO):
"""
NUM_READERS = 2

# MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read)
MAX_BUF_SIZE = 1024 * 1024 * 1024 # 10 MiB for test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value should it be for non-test?

self.find_largest_buffer()

def find_largest_buffer(self):
self._current_max_buf_length = len(self._bufs[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring comment

@@ -71,6 +71,7 @@ def test_not_found(self):

def check_file_target_contents(self, target):
"""Checks to make sure that the specified file has the contents 'hello world'."""
# This can not be checked, Since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants