Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: performance optimization for directories #1970

Closed
shcheklein opened this issue May 8, 2019 · 18 comments
Closed

dvc: performance optimization for directories #1970

shcheklein opened this issue May 8, 2019 · 18 comments
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks research

Comments

@shcheklein
Copy link
Member

Context is here:

https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:

It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....
@pared
Copy link
Contributor

pared commented May 8, 2019

Sample script that seems to reproduce users problem:

#! /bin/bash

rm -rf storage repo
mkdir storage repo
mkdir repo/data

for i in {1..100000}
do
  echo ${i} >> repo/data/${i}
done 

cd repo

git init 
dvc init

dvc remote add -d storage ../storage

dvc add data
dvc commit data.dvc
git add .gitignore data.dvc

git commit -am "init"
dvc push

dvc unprotect data
echo update  >> data/update
dvc add data

After adding update, md5 computation for large directory is retriggered.

@efiop
Copy link
Contributor

efiop commented May 8, 2019

@pared when you are unprotecting all files inside data are copied, so dvc doesn't have entries for those files in State db, hence the recomputation.

@ghost ghost changed the title performance optimization for directories dvc: performance optimization for directories May 10, 2019
@ghost ghost added the performance improvement over resource / time consuming tasks label May 10, 2019
@ghost
Copy link

ghost commented May 10, 2019

After some research, I think we can remove the binary heuristic on file_md5, here's how to reproduce it:

Create 100000 files with 3KiB of random content and 10 files with 3GiB of random content:

mkdir data

for file in {0..100000}; do
  dd if=/dev/urandom of=data/$file bs=3K count=1
done

mkdir other_data

for file in {0..10}; do
  dd if=/dev/urandom of=other_data/$file bs=100M count=30
done

Results ordered from fastest to slowest operation:

  • md5sum:
time md5sum * > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m1.568s
# user    0m1.047s
# sys     0m0.496s


# Files:  10
# Size:   3 GiB
#
# real    7m15.048s
# user    1m47.444s
# sys     0m23.405s
  • git lfs track "data/**" && git add:
# git lfs track "data/**"
time git add data

# Files: 100,000
# Size:   3 KiB
#
# real    1m23.442s
# user    0m39.889s
# sys     0m30.874s
  • Python's hashlib reading the whole file:
time python -c "
import os
import hashlib

for file in os.listdir():
    with open(file, 'rb') as fobj:
        print(file, hashlib.md5(fobj.read()).hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.581s
# user    0m1.900s
# sys     0m0.643s


# Files:  10
# Size:   3 GiB
#
# real    8m46.469s
# user    1m2.379s
# sys     0m49.283s
  • Python's hashlib reading with chunks:
time python -c "
import os
import hashlib

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            hash.update(data)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.753s
# user    0m1.932s
# sys     0m0.802s


# Files:  10
# Size:   3 GiB
#
# real    7m57.565s
# user    1m53.423s
# sys     0m21.162s
  • Python's hashlib reading with chunks + CRLF
time python -c "
import os
import hashlib

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            chunk = data.replace(b'\r\n', b'\n')

            hash.update(chunk)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.986s
# user    0m2.322s
# sys     0m0.644s

# Files:  10
# Size:   3 GiB
#
# real    7m26.300s
# user    2m34.908s
# sys     0m23.551s
  • Python's hashlib reading with chunks + CRLF + binary optimization
time python -c "
import os
import hashlib
from dvc.istextfile import istextfile

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()
    binary = not istextfile(file)

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            if binary:
                chunk = data
            else:
                chunk = data.replace(b'\r\n', b'\n')

            hash.update(chunk)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m7.610s
# user    0m6.028s
# sys     0m1.528s


# Files:  10
# Size:   3 GiB
#
# real    7m44.754s
# user    1m53.498s
# sys     0m17.882s
  • DVC's file_md5:
time python -c "
import os
from dvc.utils import file_md5

for file in os.listdir():
    print(file, file_md5(file)[0])
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m8.927s
# user    0m7.092s
# sys     0m1.768s

# Files:  10
# Size:   3 GiB
#
# real    7m40.479s
# user    2m7.392s
# sys     0m21.710s

@ghost ghost self-assigned this May 10, 2019
@ghost
Copy link

ghost commented May 10, 2019

Also, file_md5 is returning both the hexdigest and the digest but we are only using hexdigest across the code base, we can remove that.

@ghost
Copy link

ghost commented May 16, 2019

Extracts from https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

@shcheklein , it tooks for dvc add and dvc push about 2hour with 30mb upload speed

Do we know DVC's bottlenecks? Is it possible to provide the user with an estimate of how long is it going to take uploading / downloading depending on the number of operations and internet speed?

Hi, @shcheklein , if i understood you correctly.We are using google cloud storage as remote for dvc.In storage we use 1 bucket.Total amount of files exceeds 100000, total size on disk 229mb ,average size of file about 1.3 kb.Our Upload speed is 30mb and download speed also 30mb. I checked upload of our dataset to similar google storage bucket without dvc and it tooks about 25 min

The problem is related to Google Cloud Storage, needs more research on why is it taking 5x with DVC.
Computing checksums shouldn't take more than a couple of minutes in the worst case scenario, there must be another operation hanging the process for such a long time.

@efiop efiop added the p1-important Important, aka current backlog of things to do label Jun 18, 2019
@efiop
Copy link
Contributor

efiop commented Jul 5, 2019

Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement.

@shcheklein
Copy link
Member Author

yep, that would be great to see if the GCS is fixed

@efiop efiop unassigned ghost Aug 8, 2019
@efiop efiop added the research label Aug 19, 2019
@Suor Suor self-assigned this Sep 17, 2019
@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

So the run with some comments:

3kb x 100k

$ time dvc add data  # comp md5s
real    1m33.184s
user    1m11.137s
sys 0m27.597s

# A long delay after pbar done and nothing happens


$ time dvc add data  # create unpacked dir
real    1m4.460s
user    0m53.056s
sys 0m10.855s

# A long delay before and especially after pbar


$ time dvc add data  # 3rd time, still slow
real    0m37.932s
user    0m31.009s
sys 0m6.715s

# A long delay at start before anything printed out
# All subsequent `dvc add`s take the same time


$ time dvc commit data.dvc
real    0m35.182s  # About the same as above
user    0m29.368s
sys 0m5.980s

$ time dvc push  # to local
real    2m18.288s                   
user    1m57.271s                                
sys     1m1.700s           

$ time dvc push  # second time, nothing to push
real    0m56.521s                                 
user    0m45.159s                                
sys     0m7.637s

$ time dvc pull  # nothing to pull
real    0m57.129s
user    0m48.626s
sys 0m8.521s

# Checkout took the majority of the time

$ rm -rf .dvc/cache && rm -rf data && time dvc pull
real    4m7.259s
user    3m24.639s
sys     1m30.983s

# at the start of checkout pbar hangs at 0% for a while


$ echo update >> data/update && time dvc add data
real    1m35.354s
user    1m12.278s
sys 0m22.342s

# The time is same as starting add

@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

Summary:

  • things are bad
  • directory not changed fast check doesn't work or insufficient
  • many noop operations take a long time
  • there are numerous UI fails:
    • hang ups at the start and/or in the end
    • hang ups in the middle, e.g. pbar done and nothing happens
    • hang ups on pbar start

@Suor
Copy link
Contributor

Suor commented Sep 27, 2019

So bench runs for dirs:

0.6.0 (master)

N = 10000, size = 1m

op total in out sleep
add 81.01 0.9 2.19 37.08
add-2 9.4 2.08 1.92 3.86
add-3 5.83 4.14 1.68 0
commit-noop 5.73 4.23 1.49 0
checkout-noop 5.95 0.55 1.5 0.4
checkout-full 52.17 0.57 2.35 1.42
push 45.32 1.03 1.8 0.13
push-noop 46.88 0.97 2.17 0.87
pull-noop 10.23 0.96 1.52 0.46
pull 162.65 0.5 2.19 44.28
add-modified 62.64 1.76 2.32 56.29

N = 100000, size = 100k

op total in out sleep
add 222.77 1.57 2.02 141.7
add-2 75.97 20.71 2.46 39.42
add-3 42.7 40.43 2.27
commit-noop 40.55 39.06 1.49
checkout-noop 43.42 1.83 2.03 4.31
checkout-full 124.17 1.62 2.72 13.22
push 232.11 4.33 2.45 0.25
push-noop 145.57 4.6 1.86 0.76
pull-noop 85.2 4.49 1.85 4.41
pull 457.27 0.46 1.95 98.57
add-modified 204.89 22.69 2.96 158

0.40.0

N = 10000, size = 1m

op total in out sleep
add 98.16 7.38 3.37 51.67
add-2 6.03 3.52 2.52 0
add-3 5.83 3.54 2.29 0
commit-noop 5.65 3.48 2.17 0
checkout-noop 3.26 1.12 2.14
checkout-full 49.81 46.04 3.77 0
push 34.84 1.2 3.53 1.29
push-noop 45.24 1.14 2.71 40.88
pull-noop 46.08 1.12 2.34 38.63
pull 100.2 1.08 3.47 56.4
add-modified 141.81 1.15 2.8 57.16

N = 100000, size = 100k

op total in out sleep
add 243.38 9.58 3.44 144.25
add-2 35.13 31.96 3.16
add-3 36.76 33.06 3.69 0
commit-noop 31.33 28.44 2.89 0
checkout-noop 5.52 2.66 2.86 0
checkout-full 98.9 88.35 10.55
push 97.57 1.89 9.83 13.7
push-noop 135.25 1.61 2.78 116.21
pull-noop 131.7 1.56 3.26 82.1
pull 198.3 1.28 6.75 109.82
add-modified 365.22 4.48 3.2 157.99

0.58.1 (before checkout changes)

N = 10000, size = 1m

op total in out sleep
add 81.99 1.23 3.56 36.41
add-2 9.81 2.91 2.63 3.13
add-3 6.96 4.21 2.74 0
commit-noop 6.67 4.16 2.5
checkout-noop 3.88 1.16 2.52 0.19
checkout-full 52.8 1.11 3.82 1.31
push 46.7 1.67 3.22 0.14
push-noop 47.87 1.55 3.36 0.77
pull-noop 7.78 1.45 2.55 1.79
pull 155.81 1.14 3.75 43.44
add-modified 63.79 2.47 3.85 55.4

N = 100000, size = 100k

op total in out sleep
add 220.51 1.71 3.33 137.64
add-2 67.87 21.03 2.71 32.8
add-3 40.55 37.8 2.75 0
commit-noop 37.08 34.04 3.04 0
checkout-noop 7.27 2.57 2.49 2.21
checkout-full 124.94 2.32 4.32 13.85
push 223.59 5.14 3.54 0.27
push-noop 147.53 5.01 3.54 0.58
pull-noop 48.35 4.92 2.86 18.59
pull 440.08 1.15 3.21 96.96
add-modified 201.93 21.99 3.57 154.42

@Suor
Copy link
Contributor

Suor commented Sep 27, 2019

And totals only in bar charts:

N = 10000, size = 1m

bench_dir N10k s1m

N = 100000, size = 100k

bench-dir-N100k-s100k

@Suor
Copy link
Contributor

Suor commented Sep 27, 2019

Some takeouts:

  • checkout change slows things down significantly
  • pull/push degraded over time significantly (probably with switching from listings to batch exists, this is local remote, so take it with a grain of salt though)
  • multithreaded md5s help not as much as one might expect

I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.

Another things is that this is tested with cache type cope only.

@efiop
Copy link
Contributor

efiop commented Aug 25, 2020

@Suor do you have your tests scripts still? Would be interesting to see what's up with it right now, since we've introduced a lot of optimizations in 1.0. Though, probably dvc-bench is enough.

@efiop
Copy link
Contributor

efiop commented Aug 25, 2020

Ok, closing for now as stale. We've introduced lots of push/pull/fetch/status/add optimizations for directories since the ticket was opened.

@efiop efiop closed this as completed Aug 25, 2020
@Suor
Copy link
Contributor

Suor commented Aug 26, 2020

I guess new benches are also run for old code. So we can compare.

@kazimpal87
Copy link

The dvc docs mention in various places that its discouraged to push/pull zip files, but I have found that transferring a single large compressed directory is almost an order of magnitude faster than than the uncompressed directory containing many small files. This is the case even if the compression ratio is pretty much 1, ie. its not about the total file size, but rather the overhead of transferring files individually. What is the advised approach in this case? Is storing zip files so bad?

@efiop
Copy link
Contributor

efiop commented Jan 13, 2021

@kazimpal87 If your directory is fairly static, you can use the archive no problem, but be aware that dvc won't be able to deduplicate different versions of them. If you use plain dirs dvc will only transfer the changes between versions, so updating the dataset will be faster. We are currently working on a new approach to transfering and storing files and directories that will handle large directories faster #829 , but that is likely to come out after 2.0 release.

@diggerdu
Copy link

diggerdu commented Dec 2, 2021

@Suor What do the add-2 and add-3 in the table refer to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks research
Projects
None yet
Development

No branches or pull requests

6 participants