-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc: performance optimization for directories #1970
Comments
Sample script that seems to reproduce users problem:
After adding update, md5 computation for large directory is retriggered. |
@pared when you are unprotecting all files inside data are copied, so dvc doesn't have entries for those files in State db, hence the recomputation. |
After some research, I think we can remove the binary heuristic on Create 100000 files with 3KiB of random content and 10 files with 3GiB of random content: mkdir data
for file in {0..100000}; do
dd if=/dev/urandom of=data/$file bs=3K count=1
done
mkdir other_data
for file in {0..10}; do
dd if=/dev/urandom of=other_data/$file bs=100M count=30
done Results ordered from fastest to slowest operation:
time md5sum * > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m1.568s
# user 0m1.047s
# sys 0m0.496s
# Files: 10
# Size: 3 GiB
#
# real 7m15.048s
# user 1m47.444s
# sys 0m23.405s
# git lfs track "data/**"
time git add data
# Files: 100,000
# Size: 3 KiB
#
# real 1m23.442s
# user 0m39.889s
# sys 0m30.874s
time python -c "
import os
import hashlib
for file in os.listdir():
with open(file, 'rb') as fobj:
print(file, hashlib.md5(fobj.read()).hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m2.581s
# user 0m1.900s
# sys 0m0.643s
# Files: 10
# Size: 3 GiB
#
# real 8m46.469s
# user 1m2.379s
# sys 0m49.283s
time python -c "
import os
import hashlib
LOCAL_CHUNK_SIZE = 1024 * 1024
for file in os.listdir():
hash = hashlib.md5()
with open(file, 'rb') as fobj:
while True:
data = fobj.read(LOCAL_CHUNK_SIZE)
if not data:
break
hash.update(data)
print(file, hash.hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m2.753s
# user 0m1.932s
# sys 0m0.802s
# Files: 10
# Size: 3 GiB
#
# real 7m57.565s
# user 1m53.423s
# sys 0m21.162s
time python -c "
import os
import hashlib
LOCAL_CHUNK_SIZE = 1024 * 1024
for file in os.listdir():
hash = hashlib.md5()
with open(file, 'rb') as fobj:
while True:
data = fobj.read(LOCAL_CHUNK_SIZE)
if not data:
break
chunk = data.replace(b'\r\n', b'\n')
hash.update(chunk)
print(file, hash.hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m2.986s
# user 0m2.322s
# sys 0m0.644s
# Files: 10
# Size: 3 GiB
#
# real 7m26.300s
# user 2m34.908s
# sys 0m23.551s
time python -c "
import os
import hashlib
from dvc.istextfile import istextfile
LOCAL_CHUNK_SIZE = 1024 * 1024
for file in os.listdir():
hash = hashlib.md5()
binary = not istextfile(file)
with open(file, 'rb') as fobj:
while True:
data = fobj.read(LOCAL_CHUNK_SIZE)
if not data:
break
if binary:
chunk = data
else:
chunk = data.replace(b'\r\n', b'\n')
hash.update(chunk)
print(file, hash.hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m7.610s
# user 0m6.028s
# sys 0m1.528s
# Files: 10
# Size: 3 GiB
#
# real 7m44.754s
# user 1m53.498s
# sys 0m17.882s
time python -c "
import os
from dvc.utils import file_md5
for file in os.listdir():
print(file, file_md5(file)[0])
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m8.927s
# user 0m7.092s
# sys 0m1.768s
# Files: 10
# Size: 3 GiB
#
# real 7m40.479s
# user 2m7.392s
# sys 0m21.710s |
Also, |
Extracts from https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images
Do we know DVC's bottlenecks? Is it possible to provide the user with an estimate of how long is it going to take uploading / downloading depending on the number of operations and internet speed?
The problem is related to Google Cloud Storage, needs more research on why is it taking 5x with DVC. |
Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement. |
yep, that would be great to see if the GCS is fixed |
So the run with some comments: 3kb x 100k
$ time dvc add data # comp md5s
real 1m33.184s
user 1m11.137s
sys 0m27.597s
# A long delay after pbar done and nothing happens
$ time dvc add data # create unpacked dir
real 1m4.460s
user 0m53.056s
sys 0m10.855s
# A long delay before and especially after pbar
$ time dvc add data # 3rd time, still slow
real 0m37.932s
user 0m31.009s
sys 0m6.715s
# A long delay at start before anything printed out
# All subsequent `dvc add`s take the same time
$ time dvc commit data.dvc
real 0m35.182s # About the same as above
user 0m29.368s
sys 0m5.980s
$ time dvc push # to local
real 2m18.288s
user 1m57.271s
sys 1m1.700s
$ time dvc push # second time, nothing to push
real 0m56.521s
user 0m45.159s
sys 0m7.637s
$ time dvc pull # nothing to pull
real 0m57.129s
user 0m48.626s
sys 0m8.521s
# Checkout took the majority of the time
$ rm -rf .dvc/cache && rm -rf data && time dvc pull
real 4m7.259s
user 3m24.639s
sys 1m30.983s
# at the start of checkout pbar hangs at 0% for a while
$ echo update >> data/update && time dvc add data
real 1m35.354s
user 1m12.278s
sys 0m22.342s
# The time is same as starting add |
Summary:
|
So bench runs for dirs: 0.6.0 (master)N = 10000, size = 1m
N = 100000, size = 100k
0.40.0N = 10000, size = 1m
N = 100000, size = 100k
0.58.1 (before checkout changes)N = 10000, size = 1m
N = 100000, size = 100k
|
Some takeouts:
I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs. Another things is that this is tested with cache type cope only. |
@Suor do you have your tests scripts still? Would be interesting to see what's up with it right now, since we've introduced a lot of optimizations in 1.0. Though, probably dvc-bench is enough. |
Ok, closing for now as stale. We've introduced lots of push/pull/fetch/status/add optimizations for directories since the ticket was opened. |
I guess new benches are also run for old code. So we can compare. |
The dvc docs mention in various places that its discouraged to push/pull zip files, but I have found that transferring a single large compressed directory is almost an order of magnitude faster than than the uncompressed directory containing many small files. This is the case even if the compression ratio is pretty much 1, ie. its not about the total file size, but rather the overhead of transferring files individually. What is the advised approach in this case? Is storing zip files so bad? |
@kazimpal87 If your directory is fairly static, you can use the archive no problem, but be aware that dvc won't be able to deduplicate different versions of them. If you use plain dirs dvc will only transfer the changes between versions, so updating the dataset will be faster. We are currently working on a new approach to transfering and storing files and directories that will handle large directories faster #829 , but that is likely to come out after 2.0 release. |
@Suor What do the |
Context is here:
https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images
The text was updated successfully, but these errors were encountered: