Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with sqlplot command for column names in a duckdb SQL database that contain spaces in the name #409

Closed
jaanli opened this issue Apr 17, 2023 · 7 comments · Fixed by #424
Assignees

Comments

@jaanli
Copy link

jaanli commented Apr 17, 2023

Hi all! This is an amazing tool I am using to teach: https://github.com/onefact/datathinking.org-codespace/blob/main/notebooks/in-class-notebooks/230417-debugging-duckdb-and-cleaning-data-with-dbt-and-reading-and-standardizing-json.ipynb

(Scroll to bottom for example boxplot)

However, I am unable to plot the columns that contain spaces in column names, such as %sqlplot boxplot --table service_requests --column 'Taxi Pick Up Location'.

Is there a way to do this or do we need to rename all the columns? (tedious, error-prone)

Tried looking this up in the docs but couldn't find it: https://jupysql.ploomber.io/en/latest/plot.html

Using this to teach a 150+ course at UPenn and University of Tartu called datathinking.org so it would be nice in case this is supported!

cc @mictadiello

@tonykploomber tonykploomber self-assigned this Apr 17, 2023
@tonykploomber
Copy link

Looking at this

Would you able to provide me the file: 311_Service_Requests_from_2010_to_Present.csv?

@edublancas
Copy link

@tonykploomber: I think we can use any csv dataset and just rename any column and add spaces to debug

@jaanli
Copy link
Author

jaanli commented Apr 19, 2023

Here's the dataset @tonykploomber @edublancas - https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

Direct download:

https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD

Is that helpful? Please let me know if you have any other questions to replicate!

For loading the data, here is the specific code we wrote with the class:

%%sql
SELECT *
FROM read_csv('/Users/me/Downloads/311_Service_Requests_from_2010_to_Present.csv',
    header=True,
    delim=',',
    quote='"',
    columns={'Unique Key': 'BIGINT', 'Created Date': 'VARCHAR', 'Closed Date': 'VARCHAR', 'Agency': 'VARCHAR', 'Agency Name': 'VARCHAR', 'Complaint Type': 'VARCHAR', 'Descriptor': 'VARCHAR', 'Location Type': 'VARCHAR', 'Incident Zip': 'VARCHAR', 'Incident Address': 'VARCHAR', 'Street Name': 'VARCHAR', 'Cross Street 1': 'VARCHAR', 'Cross Street 2': 'VARCHAR', 'Intersection Street 1': 'VARCHAR', 'Intersection Street 2': 'VARCHAR', 'Address Type': 'VARCHAR', 'City': 'VARCHAR', 'Landmark': 'VARCHAR', 'Facility Type': 'VARCHAR', 'Status': 'VARCHAR', 'Due Date': 'VARCHAR', 'Resolution Description': 'VARCHAR', 'Resolution Action Updated Date': 'VARCHAR', 'Community Board': 'VARCHAR', 'BBL': 'VARCHAR', 'Borough': 'VARCHAR', 'X Coordinate (State Plane)': 'VARCHAR', 'Y Coordinate (State Plane)': 'VARCHAR', 'Open Data Channel Type': 'VARCHAR', 'Park Facility Name': 'VARCHAR', 'Park Borough': 'VARCHAR', 'Vehicle Type': 'VARCHAR', 'Taxi Company Borough': 'VARCHAR', 'Taxi Pick Up Location': 'VARCHAR', 'Bridge Highway Name': 'VARCHAR', 'Bridge Highway Direction': 'VARCHAR', 'Road Ramp': 'VARCHAR', 'Bridge Highway Segment': 'VARCHAR', 'Latitude': 'DOUBLE', 'Longitude': 'DOUBLE', 'Location': 'VARCHAR'}) 
LIMIT 1000000;

Test to plot with jupysql:

%sqlplot boxplot --table service_requests --column longitude

Result:

image

Test that breaks:

%sqlplot histogram --table service_requests --column 'Incident Zip'

Result:

---------------------------------------------------------------------------
BinderException                           Traceback (most recent call last)
File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1900, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1899     if not evt_handled:
-> 1900         self.dialect.do_execute(
   1901             cursor, statement, parameters, context
   1902         )
   1904 if self._has_events or self.engine._has_events:

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736     cursor.execute(statement, parameters)

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/duckdb_engine/__init__.py:142, in ConnectionWrapper.execute(self, statement, parameters, context)
    141     else:
--> 142         self.__c.execute(statement, parameters)
    143 except RuntimeError as e:

BinderException: Binder Error: Referenced column "'Incident Zip'" not found in FROM clause!
Candidate bindings: "service_requests.Incident Zip"
LINE 3:     MIN("'Incident Zip'"),
                ^

The above exception was the direct cause of the following exception:

ProgrammingError                          Traceback (most recent call last)
Cell In[5], line 1
----> 1 get_ipython().run_line_magic('sqlplot', "histogram --table service_requests --column 'Incident Zip'")

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2369, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2367     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2368 with self.builtin_trap:
-> 2369     result = fn(*args, **kwargs)
   2371 # The code below prevents the output from being displayed
   2372 # when using magics with decodator @output_can_be_silenced
   2373 # when the last Python token in the expression is a ';'.
   2374 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/ploomber_core/exceptions.py:95, in modify_exceptions.<locals>.wrapper(*args, **kwargs)
     92 @wraps(fn)
     93 def wrapper(*args, **kwargs):
     94     try:
---> 95         return fn(*args, **kwargs)
     96     except (ValueError, TypeError) as e:
     97         if COMMUNITY not in e.args[0]:

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sql/magic_plot.py:79, in SqlPlotMagic.execute(self, line, cell, local_ns)
     71     return plot.boxplot(
     72         table=cmd.args.table,
     73         column=column,
   (...)
     76         conn=None,
     77     )
     78 elif cmd.args.line[0] in {"hist", "histogram"}:
---> 79     return plot.histogram(
     80         table=cmd.args.table,
     81         column=column,
     82         bins=cmd.args.bins,
     83         with_=cmd.args.with_,
     84         conn=None,
     85     )
     86 else:
     87     raise ValueError(
     88         f"Unknown plot {cmd.args.line[0]!r}. Must be: 'histogram' or 'boxplot'"
     89     )

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/ploomber_core/dependencies.py:41, in requires.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     33 @wraps(f)
     34 def wrapper(*args, **kwargs):
     35     check_installed(
     36         pkgs=pkgs,
     37         name=name or f.__name__,
     38         extra_msg=extra_msg,
     39         pip_names=pip_names,
     40     )
---> 41     return f(*args, **kwargs)

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/ploomber_core/telemetry/telemetry.py:742, in Telemetry.log_call.<locals>._log_call.<locals>.wrapper(*args, **kwargs)
    740         result = func(*injected_args, **kwargs)
    741     else:
--> 742         result = func(_payload, *args, **kwargs)
    743 else:
    744     result = func(*args, **kwargs)

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sql/plot.py:331, in histogram(payload, table, column, bins, with_, conn)
    329 payload["connection_info"] = sql.connection.Connection._get_curr_connection_info()
    330 if isinstance(column, str):
--> 331     bin_, height = _histogram(table, column, bins, with_=with_, conn=conn)
    332     ax.bar(bin_, height, align="center", width=bin_[-1] - bin_[-2])
    333     ax.set_title(f"{column!r} from {table!r}")

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/ploomber_core/exceptions.py:95, in modify_exceptions.<locals>.wrapper(*args, **kwargs)
     92 @wraps(fn)
     93 def wrapper(*args, **kwargs):
     94     try:
---> 95         return fn(*args, **kwargs)
     96     except (ValueError, TypeError) as e:
     97         if COMMUNITY not in e.args[0]:

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sql/plot.py:361, in _histogram(table, column, bins, with_, conn)
    358     conn = sql.connection.Connection.current.session
    360 # FIXME: we're computing all the with elements twice
--> 361 min_, max_ = _min_max(conn, table, column, with_=with_)
    362 range_ = max_ - min_
    363 bin_size = range_ / bins

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sql/plot.py:284, in _min_max(con, table, column, with_)
    281 if with_:
    282     query = str(store.render(query, with_=with_))
--> 284 min_, max_ = con.execute(query).fetchone()
    285 return min_, max_

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1365, in Connection.execute(self, statement, *multiparams, **params)
   1356 if isinstance(statement, util.string_types):
   1357     util.warn_deprecated_20(
   1358         "Passing a string to Connection.execute() is "
   1359         "deprecated and will be removed in version 2.0.  Use the "
   (...)
   1362         "driver-level SQL string."
   1363     )
-> 1365     return self._exec_driver_sql(
   1366         statement,
   1367         multiparams,
   1368         params,
   1369         _EMPTY_EXECUTION_OPTS,
   1370         future=False,
   1371     )
   1373 try:
   1374     meth = statement._execute_on_connection

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1669, in Connection._exec_driver_sql(self, statement, multiparams, params, execution_options, future)
   1659         (
   1660             statement,
   1661             distilled_params,
   (...)
   1665             statement, distilled_parameters, execution_options
   1666         )
   1668 dialect = self.dialect
-> 1669 ret = self._execute_context(
   1670     dialect,
   1671     dialect.execution_ctx_cls._init_statement,
   1672     statement,
   1673     distilled_parameters,
   1674     execution_options,
   1675     statement,
   1676     distilled_parameters,
   1677 )
   1679 if not future:
   1680     if self._has_events or self.engine._has_events:

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1943, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1940             branched.close()
   1942 except BaseException as e:
-> 1943     self._handle_dbapi_exception(
   1944         e, statement, parameters, cursor, context
   1945     )
   1947 return result

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/base.py:2124, in Connection._handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   2122     util.raise_(newraise, with_traceback=exc_info[2], from_=e)
   2123 elif should_wrap:
-> 2124     util.raise_(
   2125         sqlalchemy_exception, with_traceback=exc_info[2], from_=e
   2126     )
   2127 else:
   2128     util.raise_(exc_info[1], with_traceback=exc_info[2])

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/util/compat.py:211, in raise_(***failed resolving arguments***)
    208     exception.__cause__ = replace_context
    210 try:
--> 211     raise exception
    212 finally:
    213     # credit to
    214     # https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
    215     # as the __traceback__ object creates a cycle
    216     del exception, replace_context, from_, with_traceback

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/base.py:1900, in Connection._execute_context(self, dialect, constructor, statement, parameters, execution_options, *args, **kw)
   1898                 break
   1899     if not evt_handled:
-> 1900         self.dialect.do_execute(
   1901             cursor, statement, parameters, context
   1902         )
   1904 if self._has_events or self.engine._has_events:
   1905     self.dispatch.after_cursor_execute(
   1906         self,
   1907         cursor,
   (...)
   1911         context.executemany,
   1912     )

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/sqlalchemy/engine/default.py:736, in DefaultDialect.do_execute(self, cursor, statement, parameters, context)
    735 def do_execute(self, cursor, statement, parameters, context=None):
--> 736     cursor.execute(statement, parameters)

File /opt/homebrew/anaconda3/envs/datathinking.org/lib/python3.11/site-packages/duckdb_engine/__init__.py:142, in ConnectionWrapper.execute(self, statement, parameters, context)
    140         self.__c.execute(statement)
    141     else:
--> 142         self.__c.execute(statement, parameters)
    143 except RuntimeError as e:
    144     if e.args[0].startswith("Not implemented Error"):

ProgrammingError: (duckdb.BinderException) Binder Error: Referenced column "'Incident Zip'" not found in FROM clause!
Candidate bindings: "service_requests.Incident Zip"
LINE 3:     MIN("'Incident Zip'"),
                ^
[SQL: 
SELECT
    MIN("'Incident Zip'"),
    MAX("'Incident Zip'")
FROM "service_requests"]
(Background on this error at: https://sqlalche.me/e/14/f405)

Full notebook: https://github.com/onefact/datathinking.org-codespace/blob/main/notebooks/in-class-notebooks/230417-debugging-duckdb-and-cleaning-data-with-dbt-and-reading-and-standardizing-json.ipynb

@idomic
Copy link

idomic commented Apr 19, 2023

My bet it's in the space Incident Zip :)

@edublancas
Copy link

I see the problem:

When executing:

%sqlplot boxplot --table table_name --column column_name

We take the column_name and wrap it into double quotes (or backticks, depending on the database vendor); however, if the column name has spaces, users need to execute:

%sqlplot boxplot --table service_requests --column 'column name'

This causes the SQL query to double wrap 'column name'; breaking the query.

The solution is to sanitize the column parameter, and remove any existing quotes.

@idomic
Copy link

idomic commented Apr 21, 2023

Pending related windows issue: #425

@edublancas
Copy link

hey @jaanli: can you give the version in master a try and let us know if it works? if so, we can make a release!

pip install git+https://github.com/ploomber/jupysql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants