-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schemas to datasets #7
Conversation
src/synthesized_datasets/_dtypes.py
Outdated
DType.DOUBLE: st.DoubleType(), | ||
DType.INTEGER: st.IntegerType(), | ||
DType.LONG: st.LongType(), | ||
DType.NULLABLE_LONG: st.FloatType(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be st.LongType() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark can't read columns that contain [1.0, , 2.0] because of the decimal points :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe with an example I can make more sense:
Say you have a column a table with the following columns:
# some csv file
colA,colB,colC
0,1,1.0
1,,
3,2,2.0
0,1,1.0
Both colA
and colB
can be read into spark as integers (one being non-null and the other being nullable. But Spark can't read colC
as an integer column. Well... it does, but it says everything is Null. i.e.,
+----+----+----+
|colA|colB|colC|
+----+----+----+
| 0| 1| NaN|
| 1| | |
| 3| 2| NaN|
| 0| 1| NaN|
+----+----+----+
So whilst pandas can handle the decimal points, spark cannot. The mapping of Nullable_long to FloatType simply means spark creates the following dataframe which I think is good enough for now! (I think we may want to ultimately rewrite the file)
+----+----+----+
|colA|colB|colC|
+----+----+----+
| 0| 1| 1.0|
| 1| | |
| 3| 2| 2.0|
| 0| 1| 1.0|
+----+----+----+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I'll set this to map to long instead of float (+ update the yaml config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
src/synthesized_datasets/_dtypes.py
Outdated
DType.DOUBLE: st.DoubleType(), | ||
DType.INTEGER: st.IntegerType(), | ||
DType.LONG: st.LongType(), | ||
DType.NULLABLE_LONG: st.FloatType(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark can't read columns that contain [1.0, , 2.0] because of the decimal points :(
tested loading all the datasets aswell:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks!
This PR adds dtypes to all of the datasets.
A few datasets were also edited to improve their quality.
removed Unnamed:0:
updated values:
'12.00.00' -> '12:00:00'
)'?'
replaced with''