using CSV became extremly slow #324

ufechner7 · 2018-10-13T11:32:19Z

With the latest version of CSV the load time on Julia 1.0.1 and Julia 0.7 increased 10 times:

  | | |_| | | | (_| |  |  Version 1.0.1 (2018-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time using CSV
 26.338247 seconds (43.67 M allocations: 2.136 GiB, 5.79% gc time)

Same machine:

  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> @time using CSV
  2.881148 seconds (2.54 M allocations: 146.385 MiB, 11.33% gc time)

It used to be faster on 0.7 then on 0.6.

Any idea?

The text was updated successfully, but these errors were encountered:

ufechner7 · 2018-10-13T12:20:14Z

Version 2.5 loads fast on 0.6 (1.4s), version 3.1 loads slow on 0.7 (4s), version 4.1 loads extremely slow (8.8s). And I had a version from master about 2 months ago that was loading in 0.8s.

nalimilan · 2018-10-13T16:09:45Z

The codebase has changed a lot in the last few months, so it's probably just due to that.

nalimilan · 2018-10-13T18:21:28Z

Actually that's intentional: CSV reads a dummy file in __init__ so that subsequent calls are faster.

CSV.jl/src/CSV.jl

Lines 244 to 250 in 8e6a5bc

    
           function __init__() 
        
               # read a dummy file to "precompile" a bunch of methods; until the `precompile` function 
        
               # can propertly store machine code, this is our best option 
        
               CSV.File(joinpath(@__DIR__, "../test/testfiles/test_utf8.csv"), allowmissing=:auto) |> DataFrame 
        
               Threads.resize_nthreads!(VALUE_BUFFERS) 
        
               return 
        
           end

I'm not really it's a real win. In particular, that means that packages which depend on CSV will pay the compilation price even if they don't call it in a given session.

ufechner7 · 2018-10-13T22:03:56Z

We are mainly do flight testing of wind drones. After each flight (they are usually short, like 10 min) we analyse the csv log file, which is not small, because we are logging about 120 values at 200 Hz. With Julia 0.6 loading a log file and preprocessing it takes about 20s on my laptop. I will create a more realistic test for loading CSV and loading the data to see how much slower the current version of CSV is compared to the 0.2.5 version that I used with Julia 0.6.

nalimilan · 2018-10-14T11:17:53Z

Then it shouldn't make a difference for you whether compilation happens when loading the package or when reading a file for the first time. In the end the total time should be similar. You can easily check that by removing the call I linked to above.

ufechner7 · 2018-10-14T18:22:22Z

Doing benchmarking with real data, total time for loading the needed packages, loading the CSV data and analysing it:

julia 0.6 + CSV 2.5 32s
julia 0.7 + CSV 3.0 34s
julia 0.7 + CSV 3.1 34s
julia 0.7 + CSV 4.0 ERROR: LoadError: ArgumentError: Package CSV does not have Parsers in its dependencies
julia 0.7 + CSV 4.1 ERROR: LoadError: KeyError: key :elevation not found

So I would also like to benchmark the current CSV version, but it is not compatible with my code. Need to check why.

quinnj · 2018-10-15T23:31:36Z

The CSV.jl-induced regression is at least fixed on master now.

607: stop running functions in init r=CarloLucibello a=KristofferC Zygote currently does the same thing as CSV used to do (JuliaData/CSV.jl#324) which is to run some representative functions in `__init__` to make the first call look faster. In reality, this just shifts the latency in the first call to package load time. The problem is that Zygote can be loaded in a Julia session without necessarily getting called. In those scenarios, users have to pay the compilation cost anyway which makes it more costly to have Zygote as a dependency. Co-authored-by: Kristoffer <kcarlsson89@gmail.com>

ufechner7 changed the title ~~using CSV become extremly slow~~ using CSV became extremly slow Oct 13, 2018

nalimilan mentioned this issue Oct 15, 2018

don't force compiling on init #325

Merged

quinnj closed this as completed Oct 15, 2018

KristofferC mentioned this issue Apr 21, 2020

stop running functions in init FluxML/Zygote.jl#607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using CSV became extremly slow #324

using CSV became extremly slow #324

ufechner7 commented Oct 13, 2018 •

edited

Loading

ufechner7 commented Oct 13, 2018 •

edited

Loading

nalimilan commented Oct 13, 2018

nalimilan commented Oct 13, 2018

ufechner7 commented Oct 13, 2018

nalimilan commented Oct 14, 2018

ufechner7 commented Oct 14, 2018

quinnj commented Oct 15, 2018

using CSV became extremly slow #324

using CSV became extremly slow #324

Comments

ufechner7 commented Oct 13, 2018 • edited Loading

ufechner7 commented Oct 13, 2018 • edited Loading

nalimilan commented Oct 13, 2018

nalimilan commented Oct 13, 2018

ufechner7 commented Oct 13, 2018

nalimilan commented Oct 14, 2018

ufechner7 commented Oct 14, 2018

quinnj commented Oct 15, 2018

ufechner7 commented Oct 13, 2018 •

edited

Loading

ufechner7 commented Oct 13, 2018 •

edited

Loading