-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WeakRefString should default to off #132
Comments
I'm sure this is due to DataFrames' |
|
Perhaps I'm missing something, but it looks like the
|
I think the bug you highlight no longer happens with JuliaData/WeakRefStrings.jl#17. Can you confirm? But I agree we should fix these issues very soon, they are really problematic and will make people suspicious about |
I can confirm that the bug no longer occurs with that PR. |
To be clear on what I mean here is I think |
I concur; weak references need to be treated carefully, and the end user might not be expecting them to pop up here (there is also added uncertainty because we can sometimes end up with categoricals instead). |
What can we do to move this forward? This issue is holding up RDatasets, which is holding up several other packages from updating to DataFrames 0.11. |
Planning on diving in this week. |
Thanks...if you need any help testing or whatever, feel free to ping me |
Unfortunately running into some 0.7 weird IR issues in the string parrsing codepaths, so that slowed me down, but the issue's been reported, so hopefully I'll have more to post soon. |
#143 fixes all known compat issues w/ 0.6/0.7. A new tag for WeakRefStrings was just merged which included a few changes to make WeakRefStrings even safe. In fact, users should never even see a WeakRefString themselves, since indexing a WeakRefStringArray returns a String. Basically, you have to reach into the WeakRefStrings internals in order to get direct access to a WeakRefString. My hope here is that WeakRefStrings can be used how I always intended, a performance optimization for parsing and allowing users to save that parsing time if the column's values end up not being needed (very common), while still being safe for when they are used (i.e. always convert to String on actual usage). I'm certainly sympathetic to the safety issues as it's annoying to have things break; but I think we're making good efforts to ensure the appropriate methods are defined for WeakRefStringArray to ensure it always keeps the right memory in reference. Please let me know if you see any issues. So to recap:
|
Sounds good to me. Thanks. |
Has this issue reemerged? I recently updated to get to
This change had caused problems because I had used plain
This is a painful type to deal with as it breaks DataFrame methods that are important to me. Sure I can |
What methods exactly? |
Yes, can you clarify what DataFrame methods break? You're actually not getting WeakRefStrings, but just a |
As I mentioned some of my work entails row mutation of dataframes. For example, I might
The boilerplate that I alluded to does fix things but looks awfully silly because it is a broadcast conversion of the columns to their own
|
Ok, this should be fixed with the merging of JuliaData/WeakRefStrings.jl#61. I just tagged a new release of WeakRefStrings, so once that is tagged in the General registry, you can do |
Thanks. It's interesting, I see the WeakRefStrings version 0.6.1 in the General registry but this package stays at version 0.5.8 when I update. I have no more time today for this, but once I get the new version of this package installed and can do some data processing with it, I'll verify that this fix is sound. |
Once I was able to get WeakRefStrings to the current version, my |
I am not sure if this is the same issue, but when I save a large DataFrame with JLD2, and then try to reload it, it gives me the following error: If I write |
Yes, it is intended. The default for string columns is to return a |
The CSV package gave me a huge breakdown last week. I had a .csv file encoded in 'BIG5HKSCS', and I was able to re-encoded it as UTF-8 by using Pandas in Python. Then I became interested in using Julia, so I tried the CSV package. fh = CSV.File(open(file_path, enc"BIG5HKSCS")) works fine, without any warning. However, when I translate the fh into DataFrame as df = DataFrame(fh), many numerical columns are re-encoded as String17 in the DataFrame. It was lucky for me to google this post mentioning this strange package called 'WeakRefString'. I totally with the OP that the 'WeakRefString' option should be turned off by default. End users like me will opt to go back to Python-Pandas for loading in CSV files when encountering the above-mentioned 'surprise'! |
@not0or1 What do you mean by "many numerical columns are re-encoded as String17"? You mean they are only composed of numbers but they ended being parsed as strings? If so that's a different problem and worth filing a new issue. Anyway, I don't see why |
They are numbers like 3498634992.00. Pandas encoded them as Float64. Btw, weakrefstrings = false gave me an error message. MethodError: no method matching read(::String; weakrefstrings=false) Stacktrace: |
I couldn't do arithmetic division or multiplication with two String17-encoded data. At this point, I will opt to go back to Python-Pandas. |
So the problem isn't |
@not0or1, a couple of things to make the discussion more productive:
|
using |
I think that CSV should default to using a regular
String
rather than usingWeakRefString
as there are still lots of cases where users can be surprised. For example:These kind of errors can lead to very subtle problems which (for example a failing DataFrame join) which can be very hard to track down. I think we should make using
WeakRefString
an default to off and allow users to specify the option if they want it.The text was updated successfully, but these errors were encountered: