don't infer schema on tmp tables #83

hunterowens · 2019-09-25T20:52:32Z

if using stage_first, tmp table schema is inferred from the dataframe.

Rather, the tmp table schema should be copied from the table in which you are going to upsert from.

The text was updated successfully, but these errors were encountered:

thekaveman · 2019-09-26T00:24:13Z

It's funny because the initial reason for staging was precisely because incoming data didn't match the final destination table schema. The tmp tables were "dump the raw incoming data here" targets, with transformations applied at the database level to convert to expected types.

Consider all the timestamp fields for example: the incoming type int epoch milliseconds doesn't match the final destination type timestamptz (assuming you're using Postgres and making that conversion vs. storing the raw value). It doesn't matter if pandas decides to infer a tmp timestamp column as int or float since we perform the cast/conversion explicitly when copying from tmp to final.

Maybe another option here is to ditch staging entirely, and enforce the schema via pandas/Dataframe semantics vs. the INSERT INTO... SELECT...FROM dance. My pandas-fu is lacking in how to deal with e.g. our enumerated types.

ian-r-rose · 2019-09-30T18:28:54Z

The specific problem we were seeing is with the associated_trips field. Since it is null much of the time, Pandas seems to be defaulting to nullable-double, which is not correct, and not castable to UUID.

Perhaps a way to satisfy both constraints here is to force all the columns of the temp table to be a TEXT type (since they are coming from JSON text, typically). I think that should be castable to everything we care about, right?

ian-r-rose · 2019-09-30T19:09:11Z

With regards to Pandas -- it's type system is somewhat less expressive than PostgreSQL, though it has recently gained the ability to extend the type system via extension arrays. I don't think that is likely worth our time, though.

An aside, extension arrays have just landed in GeoPandas, which is very exciting!

thekaveman · 2019-09-30T20:56:09Z

@ian-r-rose null associated_trips is why we originally introduced some of that preprocessing around adding missing nullable columns to the temp dataframe, and dtype-casting the associated_trips column to object. See #28 and the fix in #29. You're still getting errors related to this though?

ian-r-rose · 2019-09-30T21:03:15Z

Yes, the error looks very similar to #28 (though with uuid instead of uuid[], natch). The fix from #29 should no longer be relevant, right?

thekaveman · 2019-09-30T22:57:01Z

The fix from #29 evolved but is still relevant, see https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/db.py#L234 and specifically for this discussion https://github.com/CityofSantaMonica/mds-provider/blob/master/mds/db/db.py#L255

ian-r-rose · 2019-10-02T16:18:09Z

Is there a reason that you have cast it to 'object' and not str (at least in the case of MDS > 0.3.0)?

thekaveman · 2019-10-14T20:11:13Z

@ian-r-rose I think that is probably a left-over from MDS 0.2.x days when this was an array field. Would a cast to str here fix your issue you think?

hunterowens mentioned this issue Sep 25, 2019

fix ETLs for scooter companies CityOfLosAngeles/aqueduct#78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't infer schema on tmp tables #83

don't infer schema on tmp tables #83

hunterowens commented Sep 25, 2019

thekaveman commented Sep 26, 2019

ian-r-rose commented Sep 30, 2019

ian-r-rose commented Sep 30, 2019

thekaveman commented Sep 30, 2019

ian-r-rose commented Sep 30, 2019

thekaveman commented Sep 30, 2019 •

edited

Loading

ian-r-rose commented Oct 2, 2019

thekaveman commented Oct 14, 2019

don't infer schema on tmp tables #83

don't infer schema on tmp tables #83

Comments

hunterowens commented Sep 25, 2019

thekaveman commented Sep 26, 2019

ian-r-rose commented Sep 30, 2019

ian-r-rose commented Sep 30, 2019

thekaveman commented Sep 30, 2019

ian-r-rose commented Sep 30, 2019

thekaveman commented Sep 30, 2019 • edited Loading

ian-r-rose commented Oct 2, 2019

thekaveman commented Oct 14, 2019

thekaveman commented Sep 30, 2019 •

edited

Loading