Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_sql_query type detection when missing data #14314

Closed
stockersky opened this issue Sep 28, 2016 · 12 comments
Closed

read_sql_query type detection when missing data #14314

stockersky opened this issue Sep 28, 2016 · 12 comments
Labels
IO SQL to_sql, read_sql, read_sql_query

Comments

@stockersky
Copy link

stockersky commented Sep 28, 2016

I have choosen Pandas for an ETL project and encountered data type detection problems.

I already posted on StackOverflow and responses were pointing the fact that missing values can make data type detection error in Pandas.
Here is the post containing code example and behaviour :

http://stackoverflow.com/questions/39298989/python-pandas-dtypes-detection-from-sql

Briefly, it appears that while querying a database, if tuple has missing fields, then the whole column type is affected : dates are not correctly interpreted or integers turn into float.

I understand than working with flat csv files can be tricky for type detection.
However, as Pandas works with a whole database layer (SqlAlchemy, cx_Oracle, DB API), when working with a database, it should have access to metadata that describes columns type.

Am I missing something? Is this a bug? Or a function still not implemented?

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-238.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 21.2.1
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Sincerely,

Guillaume

@jreback
Copy link
Contributor

jreback commented Sep 28, 2016

your link is broken. pls post a reproducible example with pd.show_versions().

@jreback jreback added the IO SQL to_sql, read_sql, read_sql_query label Sep 28, 2016
@stockersky
Copy link
Author

Link is fixed and version information is appended to my first post.

@jorisvandenbossche
Copy link
Member

read_sql_query just gets result sets back, without any column type information. If you use the read_sql_table functions, there it uses the column type information through SQLAlchemy.

However, I would think pandas should be able to handle the missing values. Because something like this happens:

In [6]: records = [(1, datetime.datetime(2012,1,1,9)), (2, None)]

In [7]: pd.DataFrame.from_records(records)
Out[7]: 
   0                   1
0  1 2012-01-01 09:00:00
1  2                 NaT

In [8]: pd.DataFrame.from_records(records).dtypes
Out[8]: 
0             int64
1    datetime64[ns]
dtype: object

Would have to dig further to see why you are getting object dtype when there are missing values.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 29, 2016

Better example from #14319 (not datetime, but same general issue of read_sql_query getting type info from the database):

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite://')
conn = engine.connect()
conn.execute("create table test (a float)")
for _ in range(5):
    conn.execute("insert into test values (NULL)")

df = pd.read_sql_query("select * from test", engine, coerce_float=True)
print(df.a)
In [10]: df.a
Out[10]:
0    None
1    None
2    None
3    None
4    None
Name: a, dtype: object

@stockersky
Copy link
Author

stockersky commented Sep 29, 2016

Here is an example with integers being detected as float while querying the oracle database :

Oracle clearly describe the 'THA_ID', 'TYA_ID', 'DOA_ID' fields being an integer :

Nom                           NULL     Type          
----------------------------- -------- ------------- 
ACP_ID                        NOT NULL NUMBER(10)    
PAR_ID                        NOT NULL NUMBER(10)    
THA_ID                                 NUMBER(10)    
TYA_ID                                 NUMBER(10) 
DOA_ID                                 NUMBER(10) 
requete = 'select * from ACTION_PARTENAIRE where rownum < 100'
df = pd.read_sql_query(requete, engine)

df.dtypes

acp_id                                    int64
par_id                                    int64
tha_id                                  float64
tya_id                                  float64
doa_id                                  float64


Those fields don't have a NOT NULL constraint...

@jorisvandenbossche
Copy link
Member

@stockersky The dtype in pandas does not depend on the column types in the sql database when using read_sql_query. That you get float64 dtype, that will probably mean that those columns actually do contain missing values. And since pandas cannot represent missing values in integer column, you get a float column.

@jorisvandenbossche
Copy link
Member

@stockersky Here is an example where I show that a missing value in a datetime column does not necessarily lead to a change in the dtype of the resulting dataframe: http://nbviewer.jupyter.org/gist/jorisvandenbossche/ef55675d25296d741726a20adf85211f
Can you give a bit more information about your case to see what it goes wrong? Does a simple example like the above also fail for you?

@stockersky
Copy link
Author

stockersky commented Sep 30, 2016

@jorisvandenbossche Thanks for spending time on this issue.

About the missing value in a datetime column, I have run some more tests and found an interesting behaviour :

  • if the first two records have the same field with missing value : pandas detects datetime type correctly,
  • if the first three records have the same field with missing value : pandas fails to detect datetime type.

example :

this succeeds (type is np.datetime64) :

    doi_date_decision
0   None
1   None
2   2013-09-11 00:00:00
3   2013-09-11 00:00:00

this fails (type is object) :

    doi_date_decision
0   None
1   None
2   None
3   2013-09-11 00:00:00

It seems like pandas pickups a sample of the result (first three records?) to set types....

@stockersky
Copy link
Author

stockersky commented Dec 8, 2016

I confirm with a exemple : if you have more than 2 missing values at the beginning of your dataframe column, then Pandas miss type detection.

Here is a gist based on the one proposed by @jorisvandenbossche showing this behaviour :
http://nbviewer.jupyter.org/gist/Stockersky/555760946ebf7fe04c879d6083f84de6

Only two missing values: type detection works. But one more missing value and it fails...

@jorisvandenbossche
Copy link
Member

@stockersky That observation indeed seems correct. Also without using the sql code, you see the same:

In [10]: records1 = [(1, None), (2, None), (3, datetime.datetime(2012,1,1,9))]

In [11]: pd.DataFrame.from_records(records1)
Out[11]: 
   0                   1
0  1                 NaT
1  2                 NaT
2  3 2012-01-01 09:00:00

In [12]: pd.DataFrame.from_records(records1).dtypes
Out[12]: 
0             int64
1    datetime64[ns]
dtype: object

In [13]: records2 = [(1, None), (2, None), (3, None), (4, datetime.datetime(2012,1,1,9))]

In [14]: pd.DataFrame.from_records(records2)
Out[14]: 
   0                    1
0  1                 None
1  2                 None
2  3                 None
3  4  2012-01-01 09:00:00

In [15]: pd.DataFrame.from_records(records2).dtypes
Out[15]: 
0     int64
1    object
dtype: object

In [16]: pd.DataFrame(records2).dtypes
Out[16]: 
0     int64
1    object
dtype: object

@jorisvandenbossche
Copy link
Member

The underlying code does indeed check the first 3 items:

sample = v[:min(3, len(v))]

@jreback Are there ways to workaround this? Or settle with the fact that the user can do the conversion to datetime manually after he/she gets the result?

@TomAugspurger
Copy link
Contributor

Closing in favor of #10285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

4 participants