Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readdlm not working with white spaces #1

Open
cdelv opened this issue Apr 3, 2022 · 7 comments
Open

readdlm not working with white spaces #1

cdelv opened this issue Apr 3, 2022 · 7 comments

Comments

@cdelv
Copy link

cdelv commented Apr 3, 2022

readdlm ignores all-white spaces by default when a delimiter is not specified. However, when one wants to specify the data type to be read it is obligatory to specify the delimiter too...

readdlm(source, delim::AbstractChar, T::Type, eol::AbstractChar; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=false, comment_char='#')

Then, in the following case,

readdlm(file, ' ', Float64, comments=true)

the function doesn't ignore the initial whitespace because the delimiter is ' ', only 1 whitespace. Then the program crashes with for example

 2 3
1 3

There should be a flag to ignore all chars that match with the delimiter or just be able to specify the type like this

readdlm(file, type=Float64, comments=true)

however this brings the problem that if the delimiter is not a whitespace the problem will persist.

@PallHaraldsson
Copy link

Well it shouldn't crash... but at least there's a workaround (especially useful for larger files):
https://github.com/JuliaData/CSV.jl

@StefanKarpinski
Copy link
Member

Can you provide an example file and invocation that exhibits this crash?

@cdelv
Copy link
Author

cdelv commented Apr 5, 2022

@StefanKarpinski Shure,

The file content is

$ cat test.txt 
 1 2
 3 4

Note that the first character is a whitespace. Using this invocation

using DelimitedFiles

file="test.txt"
data=readdlm(file, ' ', Float64,)

I get the following error

at row 1, column 1 : ErrorException("file entry \"\" cannot be converted to Float64")

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] dlm_fill(::DataType, ::Array{Array{Int64,1},1}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:514
 [3] readdlm_string(::String, ::Char, ::Type{T} where T, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:470
 [4] readdlm_auto(::String, ::Char, ::Type{T} where T, ::Char, ::Bool; opts::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:244
 [5] readdlm_auto at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:233 [inlined]
 [6] #readdlm#6 at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:225 [inlined]
 [7] readdlm at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:225 [inlined]
 [8] #readdlm#2 at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:86 [inlined]
 [9] readdlm(::String, ::Char, ::Type{T} where T) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:86
 [10] top-level scope at In[4]:2
 [11] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091

This invocation provides something I dint spect

data=readdlm(file, ' ')
2×3 Array{Any,2}:
 ""  1  2
 ""  3  4

And the one that works is

data=readdlm(file)
2×2 Array{Float64,2}:
 1.0  2.0
 3.0  4.0

This is a minimal working example, but it seems to happen in more complicated cases too.

@StefanKarpinski
Copy link
Member

That's not what we would call a "crash", it's an error message indicating that the file doesn't have valid formatting. The error message isn't great (it reads an empty field before the leading space and then cannot convert that to float), but readdlm is also effectively deprecated and the CSV package should be used.

@JeffBezanson
Copy link
Contributor

I guess it needs an option to skip empty fields?

@ViralBShah ViralBShah transferred this issue from JuliaLang/julia Apr 8, 2022
@ronisbr
Copy link
Member

ronisbr commented Apr 14, 2023

One option that would be very helpful is to skip empty fields as @JeffBezanson said or, which I think it is better, to consider multiple delimiter chars as one single separator. There are some space index files that has this kind of format:

  1997   8   2450457.0  73.8  75.8  78.7  78.7  74.0  73.0  67.9  68.8  1B11
  1997   9   2450458.0  73.7  75.6  79.9  78.7  74.7  72.9  66.9  68.6  1B11
  1997  10   2450459.0  75.4  75.4  80.3  78.7  76.3  72.8  70.5  68.3  1B11

DelimitedFiles.jl seems to be not capable of parsing it correctly. However, if there is an option like skip_multiple_delims or something, it could! If you accept this proposal, I can submit a PR!

@PallHaraldsson
Copy link

PallHaraldsson commented Apr 26, 2023

I didn't check, but if this works with CSV.jl and/or (the maybe less known) DLMReader.jl, then maybe not bother implement this (and document both, at least those of where this works)? They at least load fast now:

julia> @time using CSV
  0.698591 seconds (691.91 k allocations: 45.800 MiB, 14.11% gc time, 2.74% compilation time)

julia> @time using DLMReader
┌ Warning: Julia started with single thread, to enable multithreaded functionalities in InMemoryDatasets.jl start Julia with multiple threads.
└ @ InMemoryDatasets ~/.julia/packages/InMemoryDatasets/60HVD/src/InMemoryDatasets.jl:205
  1.905116 seconds (2.06 M allocations: 131.414 MiB, 8.09% gc time, 12.93% compilation time: 88% of which was recompilation)


Slightly faster with:

$ julia -t auto

julia> @time using DLMReader
  1.685144 seconds (1.90 M allocations: 120.533 MiB, 7.65% gc time, 1.26% compilation time)

CSV.jl took slightly longer with auto though. Maybe a fluke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants