Add schemas to datasets #7

simonhkswan · 2024-02-13T09:43:27Z

This PR adds dtypes to all of the datasets.

A few datasets were also edited to improve their quality.

removed Unnamed:0:

healthcare
fire-peril
homesite-quotes
price-paid-household
vehicle-insurance

updated values:

air-quality: the Time column was updated to represent timestamps ('12.00.00' -> '12:00:00')
household_power: float columns had '?' replaced with ''

src/synthesized_datasets/_datasets.py

tomcarter23 · 2024-02-13T14:13:37Z

src/synthesized_datasets/_dtypes.py

+    DType.DOUBLE: st.DoubleType(),
+    DType.INTEGER: st.IntegerType(),
+    DType.LONG: st.LongType(),
+    DType.NULLABLE_LONG: st.FloatType(),


should this be st.LongType() ?

Spark can't read columns that contain [1.0, , 2.0] because of the decimal points :(

I'm not sure I follow?

Maybe with an example I can make more sense:

Say you have a column a table with the following columns:

# some csv file colA,colB,colC 0,1,1.0 1,, 3,2,2.0 0,1,1.0

Both colA and colB can be read into spark as integers (one being non-null and the other being nullable. But Spark can't read colC as an integer column. Well... it does, but it says everything is Null. i.e.,

+----+----+----+ |colA|colB|colC| +----+----+----+ | 0| 1| NaN| | 1| | | | 3| 2| NaN| | 0| 1| NaN| +----+----+----+

So whilst pandas can handle the decimal points, spark cannot. The mapping of Nullable_long to FloatType simply means spark creates the following dataframe which I think is good enough for now! (I think we may want to ultimately rewrite the file)

+----+----+----+ |colA|colB|colC| +----+----+----+ | 0| 1| 1.0| | 1| | | | 3| 2| 2.0| | 0| 1| 1.0| +----+----+----+

As discussed, I'll set this to map to long instead of float (+ update the yaml config).

src/synthesized_datasets/_dtypes.py

src/synthesized_datasets/__init__.py

src/synthesized_datasets/_dtypes.py

src/synthesized_datasets/_datasets.py

simonhkswan · 2024-02-13T14:57:32Z

src/synthesized_datasets/_dtypes.py

+    DType.DOUBLE: st.DoubleType(),
+    DType.INTEGER: st.IntegerType(),
+    DType.LONG: st.LongType(),
+    DType.NULLABLE_LONG: st.FloatType(),


Spark can't read columns that contain [1.0, , 2.0] because of the decimal points :(

src/synthesized_datasets/_dtypes.py

simonhkswan · 2024-02-16T11:28:20Z

tested loading all the datasets aswell:

====================== 110 passed, 4 skipped, 126 warnings in 52.84s =======================

tomcarter23

Nice thanks!

Add schemas to datasets and parse from yaml file

fd0ad48

simonhkswan self-assigned this Feb 13, 2024

simonhkswan added 3 commits February 13, 2024 09:48

removed Unnamed: 0 from the csv files

a000187

add spark support for schemas

54af543

add pyspark dtypes

6458e45

tomcarter23 reviewed Feb 13, 2024

View reviewed changes

src/synthesized_datasets/_datasets.py Show resolved Hide resolved

tomcarter23 reviewed Feb 13, 2024

View reviewed changes

src/synthesized_datasets/_dtypes.py Outdated Show resolved Hide resolved

simonhkswan commented Feb 13, 2024

View reviewed changes

tomcarter23 reviewed Feb 13, 2024

View reviewed changes

src/synthesized_datasets/_dtypes.py Show resolved Hide resolved

simonhkswan added 6 commits February 13, 2024 17:52

add noaa_100gb_dtypes_set

f82dbcd

update gitignore

c203a11

add pandas type map and dateformats to datasets.yaml

ab07e14

Update _datasets.py

806d67a

Update _dtypes.py

757ceea

Add tests for each dataset

c90dbb4

tomcarter23 approved these changes Feb 16, 2024

View reviewed changes

simonhkswan merged commit 8fd344f into master Feb 16, 2024

simonhkswan deleted the add_schema branch February 16, 2024 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schemas to datasets #7

Add schemas to datasets #7

simonhkswan commented Feb 13, 2024 •

edited

Loading

tomcarter23 Feb 13, 2024

simonhkswan Feb 13, 2024

tomcarter23 Feb 13, 2024

simonhkswan Feb 13, 2024

simonhkswan Feb 14, 2024

simonhkswan Feb 16, 2024

simonhkswan Feb 13, 2024

simonhkswan commented Feb 16, 2024

tomcarter23 left a comment

Add schemas to datasets #7

Add schemas to datasets #7

Conversation

simonhkswan commented Feb 13, 2024 • edited Loading

tomcarter23 Feb 13, 2024

Choose a reason for hiding this comment

simonhkswan Feb 13, 2024

Choose a reason for hiding this comment

tomcarter23 Feb 13, 2024

Choose a reason for hiding this comment

simonhkswan Feb 13, 2024

Choose a reason for hiding this comment

simonhkswan Feb 14, 2024

Choose a reason for hiding this comment

simonhkswan Feb 16, 2024

Choose a reason for hiding this comment

simonhkswan Feb 13, 2024

Choose a reason for hiding this comment

simonhkswan commented Feb 16, 2024

tomcarter23 left a comment

Choose a reason for hiding this comment

simonhkswan commented Feb 13, 2024 •

edited

Loading