Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using CSV became extremly slow #324

Closed
ufechner7 opened this issue Oct 13, 2018 · 7 comments
Closed

using CSV became extremly slow #324

ufechner7 opened this issue Oct 13, 2018 · 7 comments

Comments

@ufechner7
Copy link

ufechner7 commented Oct 13, 2018

With the latest version of CSV the load time on Julia 1.0.1 and Julia 0.7 increased 10 times:

  | | |_| | | | (_| |  |  Version 1.0.1 (2018-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time using CSV
 26.338247 seconds (43.67 M allocations: 2.136 GiB, 5.79% gc time)

Same machine:

  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> @time using CSV
  2.881148 seconds (2.54 M allocations: 146.385 MiB, 11.33% gc time)

It used to be faster on 0.7 then on 0.6.

Any idea?

@ufechner7
Copy link
Author

ufechner7 commented Oct 13, 2018

Version 2.5 loads fast on 0.6 (1.4s), version 3.1 loads slow on 0.7 (4s), version 4.1 loads extremely slow (8.8s). And I had a version from master about 2 months ago that was loading in 0.8s.

@ufechner7 ufechner7 changed the title using CSV become extremly slow using CSV became extremly slow Oct 13, 2018
@nalimilan
Copy link
Member

The codebase has changed a lot in the last few months, so it's probably just due to that.

@nalimilan
Copy link
Member

Actually that's intentional: CSV reads a dummy file in __init__ so that subsequent calls are faster.

CSV.jl/src/CSV.jl

Lines 244 to 250 in 8e6a5bc

function __init__()
# read a dummy file to "precompile" a bunch of methods; until the `precompile` function
# can propertly store machine code, this is our best option
CSV.File(joinpath(@__DIR__, "../test/testfiles/test_utf8.csv"), allowmissing=:auto) |> DataFrame
Threads.resize_nthreads!(VALUE_BUFFERS)
return
end

I'm not really it's a real win. In particular, that means that packages which depend on CSV will pay the compilation price even if they don't call it in a given session.

@ufechner7
Copy link
Author

We are mainly do flight testing of wind drones. After each flight (they are usually short, like 10 min) we analyse the csv log file, which is not small, because we are logging about 120 values at 200 Hz. With Julia 0.6 loading a log file and preprocessing it takes about 20s on my laptop. I will create a more realistic test for loading CSV and loading the data to see how much slower the current version of CSV is compared to the 0.2.5 version that I used with Julia 0.6.

@nalimilan
Copy link
Member

Then it shouldn't make a difference for you whether compilation happens when loading the package or when reading a file for the first time. In the end the total time should be similar. You can easily check that by removing the call I linked to above.

@ufechner7
Copy link
Author

Doing benchmarking with real data, total time for loading the needed packages, loading the CSV data and analysing it:

  • julia 0.6 + CSV 2.5 32s
  • julia 0.7 + CSV 3.0 34s
  • julia 0.7 + CSV 3.1 34s
  • julia 0.7 + CSV 4.0 ERROR: LoadError: ArgumentError: Package CSV does not have Parsers in its dependencies
  • julia 0.7 + CSV 4.1 ERROR: LoadError: KeyError: key :elevation not found

So I would also like to benchmark the current CSV version, but it is not compatible with my code. Need to check why.

@quinnj
Copy link
Member

quinnj commented Oct 15, 2018

The CSV.jl-induced regression is at least fixed on master now.

@quinnj quinnj closed this as completed Oct 15, 2018
bors bot added a commit to FluxML/Zygote.jl that referenced this issue Apr 24, 2020
607: stop running functions in init r=CarloLucibello a=KristofferC

Zygote currently does the same thing as CSV used to do (JuliaData/CSV.jl#324) which is to run some representative functions in `__init__` to make the first call look faster. In reality, this just shifts the latency in the first call to package load time. The problem is that Zygote can be loaded in a Julia session without necessarily getting called. In those scenarios, users have to pay the compilation cost anyway which makes it more costly to have Zygote as a dependency. 



Co-authored-by: Kristoffer <kcarlsson89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants