Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't infer schema on tmp tables #83

Open
hunterowens opened this issue Sep 25, 2019 · 8 comments
Open

don't infer schema on tmp tables #83

hunterowens opened this issue Sep 25, 2019 · 8 comments

Comments

@hunterowens
Copy link
Collaborator

if using stage_first, tmp table schema is inferred from the dataframe.

Rather, the tmp table schema should be copied from the table in which you are going to upsert from.

@thekaveman
Copy link
Contributor

It's funny because the initial reason for staging was precisely because incoming data didn't match the final destination table schema. The tmp tables were "dump the raw incoming data here" targets, with transformations applied at the database level to convert to expected types.

Consider all the timestamp fields for example: the incoming type int epoch milliseconds doesn't match the final destination type timestamptz (assuming you're using Postgres and making that conversion vs. storing the raw value). It doesn't matter if pandas decides to infer a tmp timestamp column as int or float since we perform the cast/conversion explicitly when copying from tmp to final.

Maybe another option here is to ditch staging entirely, and enforce the schema via pandas/Dataframe semantics vs. the INSERT INTO... SELECT...FROM dance. My pandas-fu is lacking in how to deal with e.g. our enumerated types.

@ian-r-rose
Copy link
Contributor

The specific problem we were seeing is with the associated_trips field. Since it is null much of the time, Pandas seems to be defaulting to nullable-double, which is not correct, and not castable to UUID.

Perhaps a way to satisfy both constraints here is to force all the columns of the temp table to be a TEXT type (since they are coming from JSON text, typically). I think that should be castable to everything we care about, right?

@ian-r-rose
Copy link
Contributor

With regards to Pandas -- it's type system is somewhat less expressive than PostgreSQL, though it has recently gained the ability to extend the type system via extension arrays. I don't think that is likely worth our time, though.

An aside, extension arrays have just landed in GeoPandas, which is very exciting!

@thekaveman
Copy link
Contributor

@ian-r-rose null associated_trips is why we originally introduced some of that preprocessing around adding missing nullable columns to the temp dataframe, and dtype-casting the associated_trips column to object. See #28 and the fix in #29. You're still getting errors related to this though?

@ian-r-rose
Copy link
Contributor

Yes, the error looks very similar to #28 (though with uuid instead of uuid[], natch). The fix from #29 should no longer be relevant, right?

@thekaveman
Copy link
Contributor

thekaveman commented Sep 30, 2019

@ian-r-rose
Copy link
Contributor

Is there a reason that you have cast it to 'object' and not str (at least in the case of MDS > 0.3.0)?

@thekaveman
Copy link
Contributor

@ian-r-rose I think that is probably a left-over from MDS 0.2.x days when this was an array field. Would a cast to str here fix your issue you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants