Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot round-trip a file (read, write, read) in some circumstances #1140

Open
TimG1964 opened this issue Sep 2, 2024 · 5 comments
Open

Cannot round-trip a file (read, write, read) in some circumstances #1140

TimG1964 opened this issue Sep 2, 2024 · 5 comments

Comments

@TimG1964
Copy link

TimG1964 commented Sep 2, 2024

Refer to this discussion on the Julialang Discourse:

Can you file an issue against CSV.jl on GitHub? There’s probably a bug when the cut point to attribute parts of the file to tasks is in a particular position.

The error described there is

┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
ERROR: LoadError: TaskFailedException

    nested task error: CSV.Error("thread = 2 fatal error, encountered an invalidly quoted field while parsing around row = 175539, col = 3: \"\"I will undertake a research trip hosted by Michele Bryd-McPhee curator of ‘Ladies of Hip-Hop Festival’ in New York City in March and July 2018 with 3 fundamental areas of enquiry; \n\", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself")
    Stacktrace:
     [1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:590
     [2] parsevalue!(::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:798
     [3] parserow
       @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:640 [inlined]
     [4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, ::Type{Tuple{}})
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:550
     [5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{Vector{CSV.Column}}, rowchunkguess::Int64, i::Int64, rows::Vector{Int64}, wholecolumnslock::ReentrantLock)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:360
     [6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
       @ CSV C:\Users\TGebbels\.julia\packages\WorkerUtilities\ey0fP\src\WorkerUtilities.jl:384
Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base .\task.jl:448
  [2] macro expansion
    @ .\task.jl:480 [inlined]
  [3] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:240
  [4] File
    @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:227 [inlined]
  [5] #File#32
    @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:223 [inlined]
  [6] CSV.File(source::String)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:162
  [7] read(source::String, sink::Type; copycols::Bool, kwargs::@Kwargs{})
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:117
  [8] read(source::String, sink::Type)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:113
  [9] top-level scope
    @ c:\Users\TGebbels...\Documents\DCMS Database\CompareCsv.jl:361
 [10] include(fname::String)
    @ Base.MainInclude .\client.jl:489
 [11] run(debug_session::VSCodeDebugger.DebugAdapter.DebugSession, error_handler::VSCodeDebugger.var"#3#4"{String})
    @ VSCodeDebugger.DebugAdapter c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\DebugAdapter\src\packagedef.jl:126
 [12] startdebugger()
    @ VSCodeDebugger c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\VSCodeDebugger\src\VSCodeDebugger.jl:45
 [13] top-level scope
    @ c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\debugger\run_debugger.jl:12
 [14] include(mod::Module, _path::String)
    @ Base .\Base.jl:495
 [15] exec_options(opts::Base.JLOptions)
    @ Base .\client.jl:318
 [16] _start()
    @ Base .\client.jl:552
in expression starting at c:\Users\TGebbels\...\Documents\DCMS Database\CompareCsv.jl:361
@nalimilan
Copy link
Member

@quinnj What's interesting is that the error doesn't happen when passing ntasks=1 to CSV.read.

@TimG1964
Copy link
Author

TimG1964 commented Sep 3, 2024

Is this the same as #1139 ?

@nalimilan
Copy link
Member

Possibly, but hard to tell without having seen the files and/or identified the root cause.

@TimG1964
Copy link
Author

TimG1964 commented Sep 3, 2024

Files are public, from the UK Department of Culture, Media and Sport, here, or by HTTP.get call to https://nationallottery.dcms.gov.uk/api/v1/grants/csv-export/. Typically just over 300MB, but growing. Updates are relatively frequent as new grant records are added.
At least one field, Description, is a quoted text field that sometimes contains new lines and can be quite lengthy. Only a quite small proportion of the 700,000 records contain new lines, though, unlike the file in #1139. This may be the reason the problem is intermittent and depends on sort order.

@nalimilan
Copy link
Member

Ah sorry I hadn't noticed that #1139 includes code to generate the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants