read_sql_query type detection when missing data #14314

stockersky · 2016-09-28T10:32:11Z

I have choosen Pandas for an ETL project and encountered data type detection problems.

I already posted on StackOverflow and responses were pointing the fact that missing values can make data type detection error in Pandas.
Here is the post containing code example and behaviour :

http://stackoverflow.com/questions/39298989/python-pandas-dtypes-detection-from-sql

Briefly, it appears that while querying a database, if tuple has missing fields, then the whole column type is affected : dates are not correctly interpreted or integers turn into float.

I understand than working with flat csv files can be tricky for type detection.
However, as Pandas works with a whole database layer (SqlAlchemy, cx_Oracle, DB API), when working with a database, it should have access to metadata that describes columns type.

Am I missing something? Is this a bug? Or a function still not implemented?

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-238.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 21.2.1
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Sincerely,

Guillaume

jreback · 2016-09-28T10:47:14Z

your link is broken. pls post a reproducible example with pd.show_versions().

stockersky · 2016-09-28T11:32:59Z

Link is fixed and version information is appended to my first post.

jorisvandenbossche · 2016-09-28T12:33:34Z

read_sql_query just gets result sets back, without any column type information. If you use the read_sql_table functions, there it uses the column type information through SQLAlchemy.

However, I would think pandas should be able to handle the missing values. Because something like this happens:

In [6]: records = [(1, datetime.datetime(2012,1,1,9)), (2, None)]

In [7]: pd.DataFrame.from_records(records)
Out[7]: 
   0                   1
0  1 2012-01-01 09:00:00
1  2                 NaT

In [8]: pd.DataFrame.from_records(records).dtypes
Out[8]: 
0             int64
1    datetime64[ns]
dtype: object

Would have to dig further to see why you are getting object dtype when there are missing values.

TomAugspurger · 2016-09-29T12:18:28Z

Better example from #14319 (not datetime, but same general issue of read_sql_query getting type info from the database):

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite://')
conn = engine.connect()
conn.execute("create table test (a float)")
for _ in range(5):
    conn.execute("insert into test values (NULL)")

df = pd.read_sql_query("select * from test", engine, coerce_float=True)
print(df.a)

In [10]: df.a
Out[10]:
0    None
1    None
2    None
3    None
4    None
Name: a, dtype: object

stockersky · 2016-09-29T16:57:31Z

Here is an example with integers being detected as float while querying the oracle database :

Oracle clearly describe the 'THA_ID', 'TYA_ID', 'DOA_ID' fields being an integer :

Nom                           NULL     Type          
----------------------------- -------- ------------- 
ACP_ID                        NOT NULL NUMBER(10)    
PAR_ID                        NOT NULL NUMBER(10)    
THA_ID                                 NUMBER(10)    
TYA_ID                                 NUMBER(10) 
DOA_ID                                 NUMBER(10)

requete = 'select * from ACTION_PARTENAIRE where rownum < 100'
df = pd.read_sql_query(requete, engine)

df.dtypes

acp_id                                    int64
par_id                                    int64
tha_id                                  float64
tya_id                                  float64
doa_id                                  float64

Those fields don't have a NOT NULL constraint...

jorisvandenbossche · 2016-09-29T18:05:38Z

@stockersky The dtype in pandas does not depend on the column types in the sql database when using read_sql_query. That you get float64 dtype, that will probably mean that those columns actually do contain missing values. And since pandas cannot represent missing values in integer column, you get a float column.

jorisvandenbossche · 2016-09-29T18:44:31Z

@stockersky Here is an example where I show that a missing value in a datetime column does not necessarily lead to a change in the dtype of the resulting dataframe: http://nbviewer.jupyter.org/gist/jorisvandenbossche/ef55675d25296d741726a20adf85211f
Can you give a bit more information about your case to see what it goes wrong? Does a simple example like the above also fail for you?

stockersky · 2016-09-30T14:19:04Z

@jorisvandenbossche Thanks for spending time on this issue.

About the missing value in a datetime column, I have run some more tests and found an interesting behaviour :

if the first two records have the same field with missing value : pandas detects datetime type correctly,
if the first three records have the same field with missing value : pandas fails to detect datetime type.

example :

this succeeds (type is np.datetime64) :

    doi_date_decision
0   None
1   None
2   2013-09-11 00:00:00
3   2013-09-11 00:00:00

this fails (type is object) :

    doi_date_decision
0   None
1   None
2   None
3   2013-09-11 00:00:00

It seems like pandas pickups a sample of the result (first three records?) to set types....

stockersky · 2016-12-08T16:15:55Z

I confirm with a exemple : if you have more than 2 missing values at the beginning of your dataframe column, then Pandas miss type detection.

Here is a gist based on the one proposed by @jorisvandenbossche showing this behaviour :
http://nbviewer.jupyter.org/gist/Stockersky/555760946ebf7fe04c879d6083f84de6

Only two missing values: type detection works. But one more missing value and it fails...

jorisvandenbossche · 2016-12-15T11:47:36Z

@stockersky That observation indeed seems correct. Also without using the sql code, you see the same:

In [10]: records1 = [(1, None), (2, None), (3, datetime.datetime(2012,1,1,9))]

In [11]: pd.DataFrame.from_records(records1)
Out[11]: 
   0                   1
0  1                 NaT
1  2                 NaT
2  3 2012-01-01 09:00:00

In [12]: pd.DataFrame.from_records(records1).dtypes
Out[12]: 
0             int64
1    datetime64[ns]
dtype: object

In [13]: records2 = [(1, None), (2, None), (3, None), (4, datetime.datetime(2012,1,1,9))]

In [14]: pd.DataFrame.from_records(records2)
Out[14]: 
   0                    1
0  1                 None
1  2                 None
2  3                 None
3  4  2012-01-01 09:00:00

In [15]: pd.DataFrame.from_records(records2).dtypes
Out[15]: 
0     int64
1    object
dtype: object

In [16]: pd.DataFrame(records2).dtypes
Out[16]: 
0     int64
1    object
dtype: object

jorisvandenbossche · 2016-12-15T12:59:06Z

The underlying code does indeed check the first 3 items:

pandas/pandas/types/cast.py

Line 742 in 49e3137

sample = v[:min(3, len(v))]

@jreback Are there ways to workaround this? Or settle with the fact that the user can do the conversion to datetime manually after he/she gets the result?

TomAugspurger · 2019-06-06T15:28:44Z

Closing in favor of #10285

jreback added the IO SQL to_sql, read_sql, read_sql_query label Sep 28, 2016

TomAugspurger mentioned this issue Sep 29, 2016

pd.read_sql_query() does not convert NULLs to NaN #14319

Closed

TomAugspurger closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_sql_query type detection when missing data #14314

read_sql_query type detection when missing data #14314

stockersky commented Sep 28, 2016 •

edited

Loading

jreback commented Sep 28, 2016

stockersky commented Sep 28, 2016

jorisvandenbossche commented Sep 28, 2016

TomAugspurger commented Sep 29, 2016 •

edited

Loading

stockersky commented Sep 29, 2016 •

edited

Loading

jorisvandenbossche commented Sep 29, 2016

jorisvandenbossche commented Sep 29, 2016

stockersky commented Sep 30, 2016 •

edited

Loading

stockersky commented Dec 8, 2016 •

edited

Loading

jorisvandenbossche commented Dec 15, 2016

jorisvandenbossche commented Dec 15, 2016

TomAugspurger commented Jun 6, 2019

read_sql_query type detection when missing data #14314

read_sql_query type detection when missing data #14314

Comments

stockersky commented Sep 28, 2016 • edited Loading

INSTALLED VERSIONS

jreback commented Sep 28, 2016

stockersky commented Sep 28, 2016

jorisvandenbossche commented Sep 28, 2016

TomAugspurger commented Sep 29, 2016 • edited Loading

stockersky commented Sep 29, 2016 • edited Loading

jorisvandenbossche commented Sep 29, 2016

jorisvandenbossche commented Sep 29, 2016

stockersky commented Sep 30, 2016 • edited Loading

stockersky commented Dec 8, 2016 • edited Loading

jorisvandenbossche commented Dec 15, 2016

jorisvandenbossche commented Dec 15, 2016

TomAugspurger commented Jun 6, 2019

stockersky commented Sep 28, 2016 •

edited

Loading

TomAugspurger commented Sep 29, 2016 •

edited

Loading

stockersky commented Sep 29, 2016 •

edited

Loading

stockersky commented Sep 30, 2016 •

edited

Loading

stockersky commented Dec 8, 2016 •

edited

Loading