dvc: performance optimization for directories #1970

shcheklein · 2019-05-08T19:13:17Z

Context is here:

https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:

It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....

The text was updated successfully, but these errors were encountered:

pared · 2019-05-08T21:39:04Z

Sample script that seems to reproduce users problem:

#! /bin/bash

rm -rf storage repo
mkdir storage repo
mkdir repo/data

for i in {1..100000}
do
  echo ${i} >> repo/data/${i}
done 

cd repo

git init 
dvc init

dvc remote add -d storage ../storage

dvc add data
dvc commit data.dvc
git add .gitignore data.dvc

git commit -am "init"
dvc push

dvc unprotect data
echo update  >> data/update
dvc add data

After adding update, md5 computation for large directory is retriggered.

efiop · 2019-05-08T23:02:27Z

@pared when you are unprotecting all files inside data are copied, so dvc doesn't have entries for those files in State db, hence the recomputation.

ghost · 2019-05-10T04:51:31Z

After some research, I think we can remove the binary heuristic on file_md5, here's how to reproduce it:

Create 100000 files with 3KiB of random content and 10 files with 3GiB of random content:

mkdir data

for file in {0..100000}; do
  dd if=/dev/urandom of=data/$file bs=3K count=1
done

mkdir other_data

for file in {0..10}; do
  dd if=/dev/urandom of=other_data/$file bs=100M count=30
done

Results ordered from fastest to slowest operation:

md5sum:

time md5sum * > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m1.568s
# user    0m1.047s
# sys     0m0.496s


# Files:  10
# Size:   3 GiB
#
# real    7m15.048s
# user    1m47.444s
# sys     0m23.405s

git lfs track "data/**" && git add:

# git lfs track "data/**"
time git add data

# Files: 100,000
# Size:   3 KiB
#
# real    1m23.442s
# user    0m39.889s
# sys     0m30.874s

Python's hashlib reading the whole file:

time python -c "
import os
import hashlib

for file in os.listdir():
    with open(file, 'rb') as fobj:
        print(file, hashlib.md5(fobj.read()).hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.581s
# user    0m1.900s
# sys     0m0.643s


# Files:  10
# Size:   3 GiB
#
# real    8m46.469s
# user    1m2.379s
# sys     0m49.283s

Python's hashlib reading with chunks:

time python -c "
import os
import hashlib

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            hash.update(data)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.753s
# user    0m1.932s
# sys     0m0.802s


# Files:  10
# Size:   3 GiB
#
# real    7m57.565s
# user    1m53.423s
# sys     0m21.162s

Python's hashlib reading with chunks + CRLF

time python -c "
import os
import hashlib

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            chunk = data.replace(b'\r\n', b'\n')

            hash.update(chunk)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.986s
# user    0m2.322s
# sys     0m0.644s

# Files:  10
# Size:   3 GiB
#
# real    7m26.300s
# user    2m34.908s
# sys     0m23.551s

Python's hashlib reading with chunks + CRLF + binary optimization

time python -c "
import os
import hashlib
from dvc.istextfile import istextfile

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()
    binary = not istextfile(file)

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            if binary:
                chunk = data
            else:
                chunk = data.replace(b'\r\n', b'\n')

            hash.update(chunk)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m7.610s
# user    0m6.028s
# sys     0m1.528s


# Files:  10
# Size:   3 GiB
#
# real    7m44.754s
# user    1m53.498s
# sys     0m17.882s

DVC's file_md5:

time python -c "
import os
from dvc.utils import file_md5

for file in os.listdir():
    print(file, file_md5(file)[0])
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m8.927s
# user    0m7.092s
# sys     0m1.768s

# Files:  10
# Size:   3 GiB
#
# real    7m40.479s
# user    2m7.392s
# sys     0m21.710s

ghost · 2019-05-10T05:23:21Z

Also, file_md5 is returning both the hexdigest and the digest but we are only using hexdigest across the code base, we can remove that.

ghost · 2019-05-16T05:55:07Z

Extracts from https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

@shcheklein , it tooks for dvc add and dvc push about 2hour with 30mb upload speed

Do we know DVC's bottlenecks? Is it possible to provide the user with an estimate of how long is it going to take uploading / downloading depending on the number of operations and internet speed?

Hi, @shcheklein , if i understood you correctly.We are using google cloud storage as remote for dvc.In storage we use 1 bucket.Total amount of files exceeds 100000, total size on disk 229mb ,average size of file about 1.3 kb.Our Upload speed is 30mb and download speed also 30mb. I checked upload of our dataset to similar google storage bucket without dvc and it tooks about 25 min

The problem is related to Google Cloud Storage, needs more research on why is it taking 5x with DVC.
Computing checksums shouldn't take more than a couple of minutes in the worst case scenario, there must be another operation hanging the process for such a long time.

efiop · 2019-07-05T17:33:09Z

Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement.

shcheklein · 2019-07-05T22:44:32Z

yep, that would be great to see if the GCS is fixed

Suor · 2019-09-24T13:49:45Z

So the run with some comments:

3kb x 100k

$ time dvc add data  # comp md5s
real    1m33.184s
user    1m11.137s
sys 0m27.597s

# A long delay after pbar done and nothing happens


$ time dvc add data  # create unpacked dir
real    1m4.460s
user    0m53.056s
sys 0m10.855s

# A long delay before and especially after pbar


$ time dvc add data  # 3rd time, still slow
real    0m37.932s
user    0m31.009s
sys 0m6.715s

# A long delay at start before anything printed out
# All subsequent `dvc add`s take the same time


$ time dvc commit data.dvc
real    0m35.182s  # About the same as above
user    0m29.368s
sys 0m5.980s

$ time dvc push  # to local
real    2m18.288s                   
user    1m57.271s                                
sys     1m1.700s           

$ time dvc push  # second time, nothing to push
real    0m56.521s                                 
user    0m45.159s                                
sys     0m7.637s

$ time dvc pull  # nothing to pull
real    0m57.129s
user    0m48.626s
sys 0m8.521s

# Checkout took the majority of the time

$ rm -rf .dvc/cache && rm -rf data && time dvc pull
real    4m7.259s
user    3m24.639s
sys     1m30.983s

# at the start of checkout pbar hangs at 0% for a while


$ echo update >> data/update && time dvc add data
real    1m35.354s
user    1m12.278s
sys 0m22.342s

# The time is same as starting add

Suor · 2019-09-24T13:54:09Z

Summary:

things are bad
directory not changed fast check doesn't work or insufficient
many noop operations take a long time
there are numerous UI fails:
- hang ups at the start and/or in the end
- hang ups in the middle, e.g. pbar done and nothing happens
- hang ups on pbar start

Suor · 2019-09-27T03:01:58Z

So bench runs for dirs:

0.6.0 (master)

N = 10000, size = 1m

op	total	in	out	sleep
add	81.01	0.9	2.19	37.08
add-2	9.4	2.08	1.92	3.86
add-3	5.83	4.14	1.68	0
commit-noop	5.73	4.23	1.49	0
checkout-noop	5.95	0.55	1.5	0.4
checkout-full	52.17	0.57	2.35	1.42
push	45.32	1.03	1.8	0.13
push-noop	46.88	0.97	2.17	0.87
pull-noop	10.23	0.96	1.52	0.46
pull	162.65	0.5	2.19	44.28
add-modified	62.64	1.76	2.32	56.29

N = 100000, size = 100k

op	total	in	out	sleep
add	222.77	1.57	2.02	141.7
add-2	75.97	20.71	2.46	39.42
add-3	42.7	40.43	2.27
commit-noop	40.55	39.06	1.49
checkout-noop	43.42	1.83	2.03	4.31
checkout-full	124.17	1.62	2.72	13.22
push	232.11	4.33	2.45	0.25
push-noop	145.57	4.6	1.86	0.76
pull-noop	85.2	4.49	1.85	4.41
pull	457.27	0.46	1.95	98.57
add-modified	204.89	22.69	2.96	158

0.40.0

N = 10000, size = 1m

op	total	in	out	sleep
add	98.16	7.38	3.37	51.67
add-2	6.03	3.52	2.52	0
add-3	5.83	3.54	2.29	0
commit-noop	5.65	3.48	2.17	0
checkout-noop	3.26	1.12	2.14
checkout-full	49.81	46.04	3.77	0
push	34.84	1.2	3.53	1.29
push-noop	45.24	1.14	2.71	40.88
pull-noop	46.08	1.12	2.34	38.63
pull	100.2	1.08	3.47	56.4
add-modified	141.81	1.15	2.8	57.16

N = 100000, size = 100k

op	total	in	out	sleep
add	243.38	9.58	3.44	144.25
add-2	35.13	31.96	3.16
add-3	36.76	33.06	3.69	0
commit-noop	31.33	28.44	2.89	0
checkout-noop	5.52	2.66	2.86	0
checkout-full	98.9	88.35	10.55
push	97.57	1.89	9.83	13.7
push-noop	135.25	1.61	2.78	116.21
pull-noop	131.7	1.56	3.26	82.1
pull	198.3	1.28	6.75	109.82
add-modified	365.22	4.48	3.2	157.99

0.58.1 (before checkout changes)

N = 10000, size = 1m

op	total	in	out	sleep
add	81.99	1.23	3.56	36.41
add-2	9.81	2.91	2.63	3.13
add-3	6.96	4.21	2.74	0
commit-noop	6.67	4.16	2.5
checkout-noop	3.88	1.16	2.52	0.19
checkout-full	52.8	1.11	3.82	1.31
push	46.7	1.67	3.22	0.14
push-noop	47.87	1.55	3.36	0.77
pull-noop	7.78	1.45	2.55	1.79
pull	155.81	1.14	3.75	43.44
add-modified	63.79	2.47	3.85	55.4

N = 100000, size = 100k

op	total	in	out	sleep
add	220.51	1.71	3.33	137.64
add-2	67.87	21.03	2.71	32.8
add-3	40.55	37.8	2.75	0
commit-noop	37.08	34.04	3.04	0
checkout-noop	7.27	2.57	2.49	2.21
checkout-full	124.94	2.32	4.32	13.85
push	223.59	5.14	3.54	0.27
push-noop	147.53	5.01	3.54	0.58
pull-noop	48.35	4.92	2.86	18.59
pull	440.08	1.15	3.21	96.96
add-modified	201.93	21.99	3.57	154.42

Suor · 2019-09-27T03:02:51Z

And totals only in bar charts:

N = 10000, size = 1m

N = 100000, size = 100k

Suor · 2019-09-27T03:10:34Z

Some takeouts:

checkout change slows things down significantly
pull/push degraded over time significantly (probably with switching from listings to batch exists, this is local remote, so take it with a grain of salt though)
multithreaded md5s help not as much as one might expect

I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.

Another things is that this is tested with cache type cope only.

efiop · 2020-08-25T21:00:16Z

@Suor do you have your tests scripts still? Would be interesting to see what's up with it right now, since we've introduced a lot of optimizations in 1.0. Though, probably dvc-bench is enough.

efiop · 2020-08-25T21:01:29Z

Ok, closing for now as stale. We've introduced lots of push/pull/fetch/status/add optimizations for directories since the ticket was opened.

Suor · 2020-08-26T12:28:05Z

I guess new benches are also run for old code. So we can compare.

kazimpal87 · 2021-01-13T18:28:38Z

The dvc docs mention in various places that its discouraged to push/pull zip files, but I have found that transferring a single large compressed directory is almost an order of magnitude faster than than the uncompressed directory containing many small files. This is the case even if the compression ratio is pretty much 1, ie. its not about the total file size, but rather the overhead of transferring files individually. What is the advised approach in this case? Is storing zip files so bad?

efiop · 2021-01-13T18:57:19Z

@kazimpal87 If your directory is fairly static, you can use the archive no problem, but be aware that dvc won't be able to deduplicate different versions of them. If you use plain dirs dvc will only transfer the changes between versions, so updating the dataset will be faster. We are currently working on a new approach to transfering and storing files and directories that will handle large directories faster #829 , but that is likely to come out after 2.0 release.

diggerdu · 2021-12-02T07:18:45Z

@Suor What do the add-2 and add-3 in the table refer to?

ghost changed the title ~~performance optimization for directories~~ dvc: performance optimization for directories May 10, 2019

ghost added the performance improvement over resource / time consuming tasks label May 10, 2019

ghost self-assigned this May 10, 2019

efiop added the c13-half-a-week label May 16, 2019

efiop added the p1-important Important, aka current backlog of things to do label Jun 18, 2019

efiop unassigned ghost Aug 8, 2019

efiop added the research label Aug 19, 2019

Suor self-assigned this Sep 17, 2019

efiop added c5-half-a-day and removed c13-half-a-week labels Sep 24, 2019

Suor mentioned this issue Sep 30, 2019

isatty #2550

Merged

4 tasks

Suor mentioned this issue Oct 20, 2019

status: extremely slow on a million image uncached directory #2638

Closed

ghost mentioned this issue Jan 30, 2020

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

Closed

efiop unassigned Suor Aug 25, 2020

efiop closed this as completed Aug 25, 2020

pared mentioned this issue Jul 19, 2021

Very slow dvc pull from minio s3 server #6324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc: performance optimization for directories #1970

dvc: performance optimization for directories #1970

shcheklein commented May 8, 2019

pared commented May 8, 2019

efiop commented May 8, 2019

ghost commented May 10, 2019 •

edited by ghost

Loading

ghost commented May 10, 2019

ghost commented May 16, 2019

efiop commented Jul 5, 2019

shcheklein commented Jul 5, 2019

Suor commented Sep 24, 2019

Suor commented Sep 24, 2019 •

edited

Loading

Suor commented Sep 27, 2019

Suor commented Sep 27, 2019

Suor commented Sep 27, 2019 •

edited

Loading

efiop commented Aug 25, 2020 •

edited

Loading

efiop commented Aug 25, 2020

Suor commented Aug 26, 2020

kazimpal87 commented Jan 13, 2021

efiop commented Jan 13, 2021

diggerdu commented Dec 2, 2021

dvc: performance optimization for directories #1970

dvc: performance optimization for directories #1970

Comments

shcheklein commented May 8, 2019

pared commented May 8, 2019

efiop commented May 8, 2019

ghost commented May 10, 2019 • edited by ghost Loading

ghost commented May 10, 2019

ghost commented May 16, 2019

efiop commented Jul 5, 2019

shcheklein commented Jul 5, 2019

Suor commented Sep 24, 2019

Suor commented Sep 24, 2019 • edited Loading

Suor commented Sep 27, 2019

0.6.0 (master)

N = 10000, size = 1m

N = 100000, size = 100k

0.40.0

N = 10000, size = 1m

N = 100000, size = 100k

0.58.1 (before checkout changes)

N = 10000, size = 1m

N = 100000, size = 100k

Suor commented Sep 27, 2019

N = 10000, size = 1m

N = 100000, size = 100k

Suor commented Sep 27, 2019 • edited Loading

efiop commented Aug 25, 2020 • edited Loading

efiop commented Aug 25, 2020

Suor commented Aug 26, 2020

kazimpal87 commented Jan 13, 2021

efiop commented Jan 13, 2021

diggerdu commented Dec 2, 2021

ghost commented May 10, 2019 •

edited by ghost

Loading

Suor commented Sep 24, 2019 •

edited

Loading

Suor commented Sep 27, 2019 •

edited

Loading

efiop commented Aug 25, 2020 •

edited

Loading