Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeakRefString should default to off #132

Closed
omus opened this issue Dec 7, 2017 · 29 comments
Closed

WeakRefString should default to off #132

omus opened this issue Dec 7, 2017 · 29 comments

Comments

@omus
Copy link
Member

omus commented Dec 7, 2017

I think that CSV should default to using a regular String rather than using WeakRefString as there are still lots of cases where users can be surprised. For example:

julia> using CSV

julia> io = IOBuffer("name,position,value\nabc,1,2.0\ndef,4,16.0")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=40, maxsize=Inf, ptr=1, mark=-1)

julia> data = CSV.read(io, categorical=false)
2×3 DataFrames.DataFrame
│ Row │ name │ position │ value │
├─────┼──────┼──────────┼───────┤
│ 1   │ abc  │ 12.0   │
│ 2   │ def  │ 416.0  │

julia> data = data[data[:position] .> 1, :]
1×3 DataFrames.DataFrame
│ Row │ name │ position │ value │
├─────┼──────┼──────────┼───────┤
│ 1   │ def  │ 416.0  │

julia> gc()

julia> x = [randstring(3) for i in 1:10000];

julia> data
1×3 DataFrames.DataFrame
│ Row │ name     │ position │ value │
├─────┼──────────┼──────────┼───────┤
│ 1\x03\0\0416.0

These kind of errors can lead to very subtle problems which (for example a failing DataFrame join) which can be very hard to track down. I think we should make using WeakRefString an default to off and allow users to specify the option if they want it.

@spurll
Copy link
Contributor

spurll commented Dec 7, 2017

I'm sure this is due to DataFrames' getindex not copying over the reference to the original memory chunk. (There was a similar issue with DataTables/NullableArrays.) At the very least, there should be a simple way to opt out of WeakRefStrings (there used to be, but it seems to have disappeared). But ultimately I would also prefer if it were not "on" by default.

@quinnj
Copy link
Member

quinnj commented Dec 7, 2017

CSV.read(file; weakrefstrings=false)

@spurll
Copy link
Contributor

spurll commented Dec 7, 2017

Perhaps I'm missing something, but it looks like the weakrefstrings kwarg has disappeared on CSV.jl 0.2. It's no longer in the docstring for CSV.read, and it doesn't appear to work:

julia> CSV.read(IOBuffer("a,b,c\nthis,that"); weakrefstrings=false)
ERROR: MethodError: no method matching CSV.Source(::Base.AbstractIOBuffer{Array{UInt8,1}}; weakrefstrings=false)
Closest candidates are:
  CSV.Source(::Union{AbstractString, IO}; delim, quotechar, escapechar, null, header, datarow, types, nullable, dateformat, decimal, truestring, falsestring, categorical, footerskip, rows_for_type_detect, rows, use_mmap) at /Users/gem/.julia/v0.6/CSV/src/Source.jl:24 got unsupported keyword argument "weakrefstrings"

@nalimilan
Copy link
Member

I think the bug you highlight no longer happens with JuliaData/WeakRefStrings.jl#17. Can you confirm?

But I agree we should fix these issues very soon, they are really problematic and will make people suspicious about WeakRefString.

@omus
Copy link
Member Author

omus commented Dec 8, 2017

I think the bug you highlight no longer happens with JuliaData/WeakRefStrings.jl#17. Can you confirm?

I can confirm that the bug no longer occurs with that PR.

@omus
Copy link
Member Author

omus commented Dec 8, 2017

WeakRefString should default to off

To be clear on what I mean here is I think WeakRefString should be used internally when processing the CSV data but when returning the final result to the end user I think we should default to returning String. Having users opt-in to using WeakRefString means they are making an informed decision.

@spurll
Copy link
Contributor

spurll commented Dec 8, 2017

I concur; weak references need to be treated carefully, and the end user might not be expecting them to pop up here (there is also added uncertainty because we can sometimes end up with categoricals instead).

@randyzwitch
Copy link

What can we do to move this forward? This issue is holding up RDatasets, which is holding up several other packages from updating to DataFrames 0.11.

@quinnj
Copy link
Member

quinnj commented Jan 2, 2018

Planning on diving in this week.

@randyzwitch
Copy link

Thanks...if you need any help testing or whatever, feel free to ping me

@quinnj
Copy link
Member

quinnj commented Jan 3, 2018

Unfortunately running into some 0.7 weird IR issues in the string parrsing codepaths, so that slowed me down, but the issue's been reported, so hopefully I'll have more to post soon.

@quinnj
Copy link
Member

quinnj commented Jan 6, 2018

#143 fixes all known compat issues w/ 0.6/0.7. A new tag for WeakRefStrings was just merged which included a few changes to make WeakRefStrings even safe. In fact, users should never even see a WeakRefString themselves, since indexing a WeakRefStringArray returns a String. Basically, you have to reach into the WeakRefStrings internals in order to get direct access to a WeakRefString.

My hope here is that WeakRefStrings can be used how I always intended, a performance optimization for parsing and allowing users to save that parsing time if the column's values end up not being needed (very common), while still being safe for when they are used (i.e. always convert to String on actual usage).

I'm certainly sympathetic to the safety issues as it's annoying to have things break; but I think we're making good efforts to ensure the appropriate methods are defined for WeakRefStringArray to ensure it always keeps the right memory in reference. Please let me know if you see any issues.

So to recap:

  • we have the weakrefstrings=true keyword argument, which defaults to true; setting to false or streaming to a Data.Sink that doesn't support weakrefstings will always ensure the user gets Vector{String}s in their output
  • WeakRefStringArray is the type returned in Sinks that support streaming WeakRefStrings, which hold a reference to the underlying memory for each WeakRefString element in the array. Upon indexing or performing any operations on the array, the WeakRefStrings are materialized as proper Julia Strings (with the memory copying).

@quinnj quinnj closed this as completed Jan 6, 2018
@spurll
Copy link
Contributor

spurll commented Jan 6, 2018

Sounds good to me. Thanks.

@MichaeLeroy
Copy link

MichaeLeroy commented May 16, 2019

Has this issue reemerged? I recently updated to get to

(v1.1) pkg> st CSV
    Status `C:\Users\xyz\.julia\environments\v1.1\Project.toml`
  [336ed68f] CSV v0.5.2
  [324d7699] CategoricalArrays v0.5.2
  [a93c6f00] DataFrames v0.18.2

This change had caused problems because I had used plain df = CSV.read("foo.csv") moves to get what I had taken to be (row) mutable DataFrames. I've adjusted by using copycols=true or CSV.FIle(..) |> DataFrame when I need mutability. But I still face WeakRefStrings:

julia> io = IOBuffer("name,position,value\nabc,1,2.0\n,4,16.0")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=37, 
maxsize=Inf, ptr=1, mark=-1)

julia> df = CSV.File(io) |> DataFrame
2×3 DataFrame
│ Row │ name    │ position │ value   │
│     │ String⍰ │ Int64    │ Float64 │
├─────┼─────────┼──────────┼─────────┤
│ 1   │ abc     │ 1        │ 2.0     │
│ 2   │ missing │ 4        │ 16.0    │

julia> df[:name]
2-element WeakRefStrings.StringArray{Union{Missing, String},1}:
 "abc"
 missing

This is a painful type to deal with as it breaks DataFrame methods that are important to me. Sure I can convert. my troubles away, but it seems wrong to have to add such boilerplate to code. Is there something that I'm missing?

@nalimilan
Copy link
Member

This is a painful type to deal with as it breaks DataFrame methods that are important to me.

What methods exactly?

@quinnj
Copy link
Member

quinnj commented May 16, 2019

Yes, can you clarify what DataFrame methods break? You're actually not getting WeakRefStrings, but just a WeakRefStrings.StringVector, when you index individual strings, they will always be String.

@MichaeLeroy
Copy link

As I mentioned some of my work entails row mutation of dataframes. For example, I might CSV.read a raw dataset into a dataframe and clean it up with ! methods such as filter! and deleterows!. The toy df that started this thread is not a particularly rich example to work with, but I can contrive a failure with it:

julia> io = IOBuffer("name,position,value\nabc,1,2.0\n,4,16.0")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=37, 
maxsize=Inf, ptr=1, mark=-1)

julia> df = CSV.File(io) |> DataFrame
2×3 DataFrame
│ Row │ name    │ position │ value   │
│     │ String⍰ │ Int64    │ Float64 │
├─────┼─────────┼──────────┼─────────┤
│ 1   │ abc     │ 1        │ 2.0     │
│ 2   │ missing │ 4        │ 16.0    │

julia> filter!(x->!ismissing(x[:name]), df)
ERROR: MethodError: no method matching deleteat!(::WeakRefStrings.StringArray{Union{Missing, 
String},1}, ::Array{Int64,1})
Closest candidates are:
  deleteat!(::Array{T,1} where T, ::AbstractArray{T,1} where T) at array.jl:1213
  deleteat!(::Array{T,1} where T, ::Any) at array.jl:1212
  deleteat!(::BitArray{1}, ::Any) at bitarray.jl:940
  ...

The boilerplate that I alluded to does fix things but looks awfully silly because it is a broadcast conversion of the columns to their own eltype.

julia> io = IOBuffer("name,position,value\nabc,1,2.0\n,4,16.0")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=37, 
maxsize=Inf, ptr=1, mark=-1)

julia> df = CSV.File(io) |> DataFrame
2×3 DataFrame
│ Row │ name    │ position │ value   │
│     │ String⍰ │ Int64    │ Float64 │
├─────┼─────────┼──────────┼─────────┤
│ 1   │ abc     │ 1        │ 2.0     │
│ 2   │ missing │ 4        │ 16.0    │

julia> df = mapcols(x->convert.(eltype(x), x), df)
2×3 DataFrame
│ Row │ name    │ position │ value   │
│     │ String⍰ │ Int64    │ Float64 │
├─────┼─────────┼──────────┼─────────┤
│ 1   │ abc     │ 1        │ 2.0     │
│ 2   │ missing │ 4        │ 16.0    │

julia> df[:name]
2-element Array{Union{Missing, String},1}:
 "abc"
 missing

julia> filter!(x->!ismissing(x[:name]), df)
1×3 DataFrame
│ Row │ name    │ position │ value   │
│     │ String⍰ │ Int64    │ Float64 │
├─────┼─────────┼──────────┼─────────┤
│ 1   │ abc     │ 1        │ 2.0     │

@quinnj
Copy link
Member

quinnj commented May 17, 2019

Ok, this should be fixed with the merging of JuliaData/WeakRefStrings.jl#61. I just tagged a new release of WeakRefStrings, so once that is tagged in the General registry, you can do ] up WeakRefStrings and everything should work.

@MichaeLeroy
Copy link

Thanks. It's interesting, I see the WeakRefStrings version 0.6.1 in the General registry but this package stays at version 0.5.8 when I update. I have no more time today for this, but once I get the new version of this package installed and can do some data processing with it, I'll verify that this fix is sound.

@MichaeLeroy
Copy link

Once I was able to get WeakRefStrings to the current version, my filter! and deleterows! on DataFrames did work. In my environment WeakRefStrings had been pinned to version 0.5.8 by a TextParse compatibility requirements. I'll raise that compatbility issue with TextParse, if no one has done so yet.

@jzazo
Copy link

jzazo commented Sep 2, 2019

I am not sure if this is the same issue, but when I save a large DataFrame with JLD2, and then try to reload it, it gives me the following error:
Warning: type WeakRefStrings.StringArray{String,1} does not exist in workspace; reconstructing

If I write using WeakRefStrings before using FileIO, the error disappears. It seems that the CSV.read(file, copycols=true) is using WeakRefStrings by default, and when storing, this is what it is saved. Why is that? Is it intended?

@quinnj
Copy link
Member

quinnj commented Sep 3, 2019

Yes, it is intended. The default for string columns is to return a WeakRefStrings.StringArray, due to the compact and efficient storage vs. a Vector{String}.

@Hiaubeng
Copy link

Hiaubeng commented Oct 9, 2021

The CSV package gave me a huge breakdown last week. I had a .csv file encoded in 'BIG5HKSCS', and I was able to re-encoded it as UTF-8 by using Pandas in Python.

Then I became interested in using Julia, so I tried the CSV package.

fh = CSV.File(open(file_path, enc"BIG5HKSCS")) works fine, without any warning. However, when I translate the fh into DataFrame as df = DataFrame(fh),

many numerical columns are re-encoded as String17 in the DataFrame. It was lucky for me to google this post mentioning this strange package called 'WeakRefString'.

I totally with the OP that the 'WeakRefString' option should be turned off by default. End users like me will opt to go back to Python-Pandas for loading in CSV files when encountering the above-mentioned 'surprise'!

@nalimilan
Copy link
Member

@not0or1 What do you mean by "many numerical columns are re-encoded as String17"? You mean they are only composed of numbers but they ended being parsed as strings? If so that's a different problem and worth filing a new issue. Anyway, I don't see why String17 would be the problem here.

@Hiaubeng
Copy link

Hiaubeng commented Oct 9, 2021

@not0or1 What do you mean by "many numerical columns are re-encoded as String17"? You mean they are only composed of numbers but they ended being parsed as strings? If so that's a different problem and worth filing a new issue. Anyway, I don't see why String17 would be the problem here.

They are numbers like 3498634992.00. Pandas encoded them as Float64.

Btw, weakrefstrings = false gave me an error message.

MethodError: no method matching read(::String; weakrefstrings=false)
Closest candidates are:
read(::AbstractString, ::Type{String}, ::Encoding) at /Users/username/.julia/packages/StringEncodings/hHXRr/src/StringEncodings.jl:438 got unsupported keyword argument "weakrefstrings"
read(::AbstractString, ::Type{T}) where T at io.jl:434 got unsupported keyword argument "weakrefstrings"
read(::AbstractString, ::Integer, ::Encoding) at /Users/simon/.julia/packages/StringEncodings/hHXRr/src/StringEncodings.jl:434 got unsupported keyword argument "weakrefstrings"
...

Stacktrace:
[1] top-level scope
@ In[9]:2
[2] eval
@ ./boot.jl:360 [inlined]
[3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1116

@Hiaubeng
Copy link

Hiaubeng commented Oct 9, 2021

Anyway, I don't see why String17 would be the problem here.

I couldn't do arithmetic division or multiplication with two String17-encoded data. At this point, I will opt to go back to Python-Pandas.

@nalimilan
Copy link
Member

So the problem isn't String17 nor WeakRefStrings, it's that numbers are not detected as such. There's probably a non-numeric character in these columns (or something that is detected as such due to encoding issues). Maybe missing values marked using a string that CSV.jl doesn't detect by default?

@quinnj
Copy link
Member

quinnj commented Oct 9, 2021

@not0or1, a couple of things to make the discussion more productive:

  • if something isn't working like you'd expect, open a new issue, with a link to the file you're trying to read (either shared publicly from somewhere like google drive, or a link to wherever the data is coming from), and the code you're using to read the file, and potentially a snapshot or copy/paste of the output where you're seeing an error or unexpected types
  • String17 isn't a type that exists anywhere; there's String15 or String31, which operate like a regular String and shouldn't have any noticeable difference as such; obviously, any string type is not a "numeric type" and won't support things like addition/multiplication/etc.
  • With any new package/library you use, I'd recommend reading up on the documentation, which explains things like string/vector types you can expect, how to specify custom missingstring values, and that weakrefstrings isn't a valid keyword argument
  • As @nalimilan mentioned, you probably have a custom "missing value" representation like NA that isn't automatically treated as missing by CSV.jl, but can easily be by just passing missingstring="NA".

@ssingh13-rms
Copy link

ssingh13-rms commented Feb 7, 2022

using stringtype=String should avoid this but I support the suggestion to keep it off by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants