Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas import error #17

Closed
Salehflaconi opened this issue Mar 3, 2019 · 16 comments
Closed

Pandas import error #17

Salehflaconi opened this issue Mar 3, 2019 · 16 comments

Comments

@Salehflaconi
Copy link

Salehflaconi commented Mar 3, 2019

I am using the below script to move the data from local file into exasol
The error appears in the import part

import boto3
import botocore
import pyexasol
import pandas as pd
import exa_cred as exa
import Bucket_info as Bk_info
import datetime
import pandas as pd
import base64

#reading credential info and bucket info from local file
exuser = exa.cred['exuser']
expwd = exa.cred['expwd']
exdsn = exa.cred['exdsn']
schema=exa.cred['schema']

C = pyexasol.connect(dsn=exdsn, user=exuser, password=expwd,schema=schema)

#Bucket information
Bucket = Bucket_name
Key = "file.csv"
outPutName = "file.csv"

#Handling exception of not found and downloading the file locally
s3 = boto3.resource('s3')
try:
s3.Bucket(Bucket).download_file(Key, outPutName)
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise

#reading the data with pandas
print(str(datetime.datetime.now()) + " - Start script ...\n")

df=pd.DataFrame
df =pd.read_csv(outPutName,delimiter=';',skipinitialspace=True,engine='python', quotechar ='"')

print("Testing that pandas is working")
print(df.head(10))

#Import part of pandas DataFrame into Exasol table
print(df.iloc[[0,1], [0,1,2]])
C.import_from_pandas(df.iloc[[0], [0,1,2]], 'file')

Import Error which is

Traceback (most recent call last):
File "C:/...../script.py", line 55, in
C.import_from_pandas(df.iloc[[0], [0,1,2]], 'script')
File "C:....\lib\site-packages\pyexasol\connection.py", line 213, in import_from_pandas
return self.import_from_callback(cb.import_from_pandas, src, table, callback_params)
File "C:\lib\site-packages\pyexasol\connection.py", line 300, in import_from_callback
raise sql_thread.exc
File "C:...\site-packages\pyexasol\connection.py", line 290, in import_from_callback
sql_thread.join()
File "C:...\site-packages\pyexasol\http_transport.py", line 53, in join
raise self.exc
File "C:...\site-packages\pyexasol\http_transport.py", line 37, in run
self.run_sql()
File "C:...\site-packages\pyexasol\http_transport.py", line 165, in run_sql
self.connection.execute(query)
File "C:...\site-packages\pyexasol\connection.py", line 140, in execute
self.last_stmt = self.cls_statement(self, query, query_params)
File "C:...\site-packages\pyexasol\statement.py", line 47, in init
self._execute()
File "C:.....\statement.py", line 141, in _execute
'sqlText': self.query,
File "C:...\site-packages\pyexasol\connection.py", line 442, in req
raise cls_err(self, req['sqlText'], ret['exception']['sqlCode'], ret['exception']['text'])
pyexasol.exceptions.ExaQueryError:
(
'] (Session: xxxx)
dsn => xxxx
user => xxxx
schema => xxxx
code => xxxx
session_id => xxxx
query => xxxx
AT 'Server_IP:port' FILE '000.csv'
)

Process finished with exit code 1

Following versions are used:
Exasol = 6.0.8
Python = Python 3.7.2
rsa = 3.4.2
pandas = 0.24.1
websocket-client = 0.55.0
pyexasol = 0.5.2
OS = Windows

@littleK0i
Copy link
Collaborator

littleK0i commented Mar 3, 2019

It is a bit weird that specific error message is not visible. Did you cut it out by any chance? It should be somewhere in this block:

'] (Session: xxxx)
dsn => xxxx
user => xxxx
schema => xxxx
code => xxxx
session_id => xxxx
query => xxxx
AT 'Server_IP:port' FILE '000.csv'
)

Also, you do not really need pandas to import this data. Just try to call .import_from_file, if it is a proper CSV file already.

Maybe you do not even need http transport at all. It is possible to import data directly from HTTP address of file in bucket.

@Salehflaconi
Copy link
Author

No i didn't cut the error
which is weird as i don't know how to google it

I am using pandas as i will handle some data in the file
that's why i need to download the file first and then doing some transformation on data

@littleK0i
Copy link
Collaborator

littleK0i commented Mar 3, 2019

Hmm, ok. Let's try to find actual error in EXA_DBA_AUDIT_SQL.

If Auditing flag is enabled, try to find failing IMPORT query in this system view. It provides a column ERROR_TEXT. Let's see what it says.

@littleK0i
Copy link
Collaborator

If it does not help, could you also try to enable debug=True connection option and post last request and response here?

Let's see if this error is truly anonymous. Maybe you discovered completely new type of execute responses which was not seen before.

@littleK0i
Copy link
Collaborator

Do you have a firewall on your local laptop, which blocks connections on custom ports? Could you try to disable all firewalls?

@Salehflaconi
Copy link
Author

I spent yesterday working on it

The problem was mismatching datatypes between the default Dataframe and the data itself
don't know why the error message was weird !!

Thank you for your help

@littleK0i
Copy link
Collaborator

littleK0i commented Mar 5, 2019

@Salehflaconi , could you attach a minimal reproducible code (1 row of data in pandas and table structure) or request log produced by connection option debug=True?

Lack of error message is not normal in such a trivial case. I would be happy to investigate it.

Thank you.

@Salehflaconi
Copy link
Author

Of course i will be happy to do that with you
but would you please help me to enable the debuging as i didn't do it before
i also enabled the audting but it didn't show any error

How i solved part of the errors :
I chengded the datatypes of each column of the dataframes as

#Changing the data types to be able to import
df[0]= df[0].astype(str)
df[1]= df[1].astype(str)
df[2]= pd.to_numeric(df[2])
df[3] = pd.to_datetime(df[3]).dt.date
df[4] = pd.to_datetime(df[4]).dt.date
df[5] = df[5].astype(int)
df[6] = df[6].astype(float)
df[7] = df[7].astype(float)
df[8] = df[8].astype(float)
df[9] = df[9].astype(float)
df[10] = df[10].astype(float)
df[11] = df[11].astype(float)
df[12] = df[12].astype(float)
df[13] = df[13].astype(float)
df[14] = df[14].astype(float)
df[15] = df[15].astype(float)
df[16] = df[16].astype(str)

And now
I am facing a new issue which is handling decoded64 column
which i converted it to

df[16] = df[16].astype(str)

Sample of my Data is
b'

I set the delimiter as
df = pd.read_csv(outPutName, delimiter=';', skipinitialspace=True, engine='python', quotechar ='"' , header=None)

but the delimiter now is not "
and my string as XML contains a lot of special characters

And the Error is

Traceback (most recent call last):
File "C:/saleh/Programs/Git/bi-python/Download_Bonsai_from_S3_to_DWH/Bonsai.py", line 93, in
C.import_from_pandas(df.iloc[[1], [16]], 'BONSAI_APK')
File "......\pyexasol\connection.py", line 213, in import_from_pandas
return self.import_from_callback(cb.import_from_pandas, src, table, callback_params)
File "......\pyexasol\connection.py", line 300, in import_from_callback
raise sql_thread.exc
File "......\pyexasol\connection.py", line 290, in import_from_callback
sql_thread.join()
File "......\pyexasol\http_transport.py", line 53, in join
raise self.exc
File "......\pyexasol\http_transport.py", line 37, in run
self.run_sql()
File "......\pyexasol\http_transport.py", line 165, in run_sql
self.connection.execute(query)
File "......\pyexasol\connection.py", line 140, in execute
self.last_stmt = self.cls_statement(self, query, query_params)
File "......\pyexasol\statement.py", line 47, in init
self._execute()
File "......\pyexasol\statement.py", line 141, in _execute
'sqlText': self.query,
File "......\pyexasol\connection.py", line 442, in req
raise cls_err(self, req['sqlText'], ret['exception']['sqlCode'], ret['exception']['text'])
pyexasol.exceptions.ExaQueryError:
(
message => ETL-2105: Error while parsing row=0 (starting from 0) [CSV Parser found at byte 993 (starting with 0 at the beginning of the row) of 993 a single field delimiter in enclosed field or not correct enclosed field in file '000.csv'. Please check for unescaped field delimiters in data fields (they have to be escaped) and correct enclosing of this field] (Session: xxxxx)
dsn => xxxx
user => xxxx
schema => xxxx
code => xxxx
session_id => xxxx
query => xxxx
AT 'IP:port FILE '000.csv'
)

@littleK0i
Copy link
Collaborator

Thank you.

In order to enable debugging, just add an extra connection option debug=True to pyexasol.connect():

Example:

C = pyexasol.connect(dsn='...', user='...', password='...', debug=True)

@cyroxx
Copy link

cyroxx commented Mar 13, 2019

I had the same error. For me, downgrading to pandas 0.23.4 helped. So maybe there is an issue with pyexasol+pandas 0.24.x?

@Salehflaconi
Copy link
Author

@cyroxx i will share with you my solution today
@wildraid there is another bug i will share today it with you

as i created work around to complete my task as i just finished it !

@Salehflaconi
Copy link
Author

Salehflaconi commented Mar 14, 2019

The problem that i faced was converting the data types from pandas dataframe and loading it into exasol
as i couldn't add new column to the dataframe !!
it was a little bit strange

that's why i added a temp columns with random values at the beginning of my script and at the end i deleted them and it works !!!

i don't know the reason but i will send the debug to you soon

this is part my script

#############################################################################################
# Author : saleh ahmed
# Date of creation : 11-03-2019
# Task description : Downloading the file from s3 and loading it into Exasol
#############################################################################################

import boto3
import botocore
import pyexasol
import exa_cred as exa
import Bucket_info as Bk_info
import pandas as pd
import base64
import numpy as np
import logging
import sys


# Settings for logging
logger = logging.getLogger('logger')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s [%(levelname)s] - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
ch.setFormatter(formatter)
logger.addHandler(ch)

logger.info("Start script ...")


#reading credential info and bucket info from local file
exuser = exa.cred['exuser']
expwd = exa.cred['expwd']
exdsn = exa.cred['exdsn']
schema=exa.cred['schema']

Bucket_name =Bk_info.Bucket['Bucket_name']
C = pyexasol.connect(dsn=exdsn, user=exuser, password=expwd,schema=schema)
logger.info(" Connected to the Exasol ")

#debug=True is to track the errors

#Bucket information
Bucket = Bucket_name
Key = "s3_TABLE.csv"
outPutName = "s3_TABLE.csv"

#Handling exception of not found and downloading the file locally
session = boto3.Session(
    aws_access_key_id='',
    aws_secret_access_key='',
)
s3 = session.resource('s3')
try:
    s3.Bucket(Bucket).download_file(Key, outPutName)
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

#reading the data with pandas
df = pd.read_csv(outPutName, delimiter=';' , header=None,lineterminator='\n')
logger.info(" Pandas dataframe has been read ")

#CREATING TEMP COLUMN TO SOLVE UNKNOW ISSUE WHICH IS RELATED TO THE CONNECTION
sLength = len(df[1])
df[17] = pd.Series(np.random.randn(sLength), index=df.index)


# #checking DataTypes
# before = df.dtypes
# print(before)
# print(df.head(10))


#Changing the data types to be able to import ir
df[0]= df[0].astype(str)
df[1]= df[1].astype(str)
df[2]= pd.to_numeric(df[2])
df[3] = pd.to_datetime(df[3]).dt.date
df[4] = pd.to_datetime(df[4]).dt.date
df[5] = df[5].astype(int)
df[6] = df[6].astype(float)
df[7] = df[7].astype(float)
df[8] = df[8].astype(float)
df[9] = df[9].astype(float)
df[10] = df[10].astype(float)
df[11] = df[11].astype(float)
df[12] = df[12].astype(float)
df[13] = df[13].astype(float)
df[14] = df[14].astype(float)
df[15] = df[15].astype(float)
df[16] = df[16].astype(str)
df[17] = df[17].astype(object)
logger.info(" Datatypes conversion has been successfully Completed    ")

#checking DataTypes
# after = df.dtypes
# print(after)
# print(df.head(10))


#Checking if the data is null or not and then decode it
#Replace the new decoded cell with the old value
for key, value in df[16].iteritems():

    if value == 'nan'  :
        #print(" true it's null")
        df.loc[key, 16]=""
    else:
        #print(" false it's not null ")
        #print(key, value)
        df.loc[key,16]=base64.b64decode(value)
        df.loc[key,16]=df.loc[key][16].decode("utf-8")
        tmp_var=df.loc[key,16]
        #print(tmp_var)
        df.loc[key,16]=df.loc[key,16]
        #print("This is new line \n")
        #print(df.iloc[[1], [16]]

logger.info(" Decoding A COLUMN  ")
#print (df.to_csv(sep=';', index=False, header=False,quotechar='"'))
#df.to_csv('....csv',sep=';', index=False, header=False,quotechar='"')


#Creating column for unknown issue but i have to do that just to skip the issue part
#i will drop the column at the end of the process

# cleanup tmp table
C.execute("TRUNCATE TABLE STAGE_FLAT_FILE_TMP.s3_TABLE;")
logger.info(" cleaning up the Stage Table   ")


sql_create_column= """ 	 ADDING THE TEMP COLUMN PHYSICALLY TO THE TABLE 
 """
C.execute(sql_create_column)

#Import from pandas DataFrame into Exasol table
C.import_from_pandas(df, 's3_TABLE')

#aDDING THE MERGE FOR BOTH THE NEW ROWS AND UPDATED ROWS
MERGE_NEW_AND_UPDATED_ROWS ="""MERGE INTO   TABLE .....
C.execute(MERGE_NEW_AND_UPDATED_ROWS)

 

#Droping useless Column
sql_drop_column= """ ALTER TABLE  "STAGE_FLAT_FILE_TMP"."s3_TABLE" DROP COLUMN TEMP_COLUMN """
C.execute(sql_drop_column)

logger.info(" Data has been inserted into STAGE_FLAT_FILE.s3_TABLE  ")

C.close()

@wildraid i couldn't send the first debug as i solved it and i don't know how to regenerate the bug again

@littleK0i
Copy link
Collaborator

littleK0i commented Mar 14, 2019

@Salehflaconi , @cyroxx , maybe it is related to this bug in pandas: pandas-dev/pandas#25048

Could you try versions 0.24.0, 0.24.1 and 0.24.2? Maybe it was fixed in the latest version.

I do not have an easy access to Windows machines at this moment.

@littleK0i
Copy link
Collaborator

@cyroxx , please try updating pyexasol to latest version 0.5.5. This should solve the problem for you.

@littleK0i
Copy link
Collaborator

littleK0i commented Mar 16, 2019

@Salehflaconi , I suspect you have genuine mismatch of number of columns between DataFrame and target table in Exasol. Try using df.info() function to check real number of columns before calling .import_from_pandas. I think you'll find an extra columns, which probably come from "Index" or as result of incorrect parsing of a source file.

@littleK0i
Copy link
Collaborator

@Salehflaconi , @cyroxx , hi guys.

Is this issue still relevant?
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants