Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram binwidth #789

Merged
merged 6 commits into from
Aug 16, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## 0.9.2dev

* [Feature] Add Add `--binwidth/-W` to ggplot histogram for specifying binwidth (#784)
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved
* [Fix] Fix a bug that caused a cell with a CTE to fail if it referenced a table/view with the same name as an existing snippet (#753)

## 0.9.1 (2023-08-10)
Expand All @@ -12,7 +13,6 @@
* [Fix] Fix error when using SQL Server with pyodbc that caused queries to fail due to multiple open result sets
* [Fix] Improves performance when converting DuckDB results to `pandas.DataFrame`
* [Fix] Fixes a bug when converting a CTE stored with `--save` into a `pandas.DataFrame` via `.DataFrame()`
* [Doc] Add Redshift tutorial
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved

## 0.9.0 (2023-08-01)

Expand Down
18 changes: 16 additions & 2 deletions doc/api/magic-plot.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.7
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
Expand Down Expand Up @@ -128,8 +128,14 @@ Shortcut: `%sqlplot hist`

`-B`/`--breaks` Custom bin intervals

`-W`/`--binwidth` Width of each bin

`-w`/`--with` Use a previously saved query as input data

```{note}
When using -b/--bins, -B/--breaks, or -W/--binwidth, you can only specify one of them. If none of them is specified, the default value for -b/--bins will be used.
```

+++

Histogram supports NULL values by skipping them. Now we can
Expand All @@ -155,12 +161,20 @@ When plotting a histogram, it divides a range with the number of bins - 1 to cal

### Specifying breaks

Breaks allow you to set custom intervals for a histogram. You can specify breaks by passing desired each end and break points separated by whitespace after `-B/--breaks`. Since those break points define a range of data points to plot, bar width, and number of bars in a histogram, make sure to pass more than 1 point that is strictly increasing and includes at least one data point. Note that using both `-b/--bins` and `-B/--breaks` isn't allowed.
Breaks allow you to set custom intervals for a histogram. You can specify breaks by passing desired each end and break points separated by whitespace after `-B/--breaks`. Since those break points define a range of data points to plot, bar width, and number of bars in a histogram, make sure to pass more than 1 point that is strictly increasing and includes at least one data point.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --breaks 3200 3400 3600 3800 4000 4200 4400 4600 4800
```

### Specifying binwidth

Binwidth allows you to set the width of bins in a histogram. To specify the binwidth, pass a desired width after `-W/--binwidth`. Since the binwidth determines details of distribution, make sure to pass a suitable positive numeric value based on your data.
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --binwidth 150
```

### Multiple columns

```{code-cell} ipython3
Expand Down
9 changes: 8 additions & 1 deletion src/sql/ggplot/geom/geom_histogram.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,19 @@ class geom_histogram(geom):

breaks : list
Divide bins with custom intervals

binwidth : int or float
Width of each bin
"""

def __init__(self, bins=None, fill=None, cmap=None, breaks=None, **kwargs):
def __init__(
self, bins=None, fill=None, cmap=None, breaks=None, binwidth=None, **kwargs
):
self.bins = bins
self.fill = fill
self.cmap = cmap
self.breaks = breaks
self.binwidth = binwidth
super().__init__(**kwargs)

@telemetry.log_call("ggplot-histogram")
Expand All @@ -45,5 +51,6 @@ def draw(self, gg, ax=None, facet=None):
facet=facet,
ax=ax or gg.axs[0],
breaks=self.breaks,
binwidth=self.binwidth,
)
return gg
15 changes: 11 additions & 4 deletions src/sql/magic_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@ class SqlPlotMagic(Magics, Configurable):
nargs="+",
help="Histogram breaks",
)
@argument(
"-W",
"--binwidth",
type=float,
help="Histogram binwidth",
)
@modify_exceptions
def execute(self, line="", cell="", local_ns=None):
"""
Expand Down Expand Up @@ -110,13 +116,13 @@ def execute(self, line="", cell="", local_ns=None):
conn=None,
)
elif cmd.args.line[0] in {"hist", "histogram"}:
# to avoid passing bins default value when breaks are given by a user
# to avoid passing bins default value when breaks or binwidth is specified
bin_specified = " --bins " in line or " -b " in line
breaks_specified = " --breaks " in line or " -B " in line
if breaks_specified and not bin_specified:
binwidth_specified = " --binwidth " in line or " -W " in line
bins = cmd.args.bins
if not bin_specified and any([breaks_specified, binwidth_specified]):
bins = None
else:
bins = cmd.args.bins

return plot.histogram(
table=table,
Expand All @@ -125,6 +131,7 @@ def execute(self, line="", cell="", local_ns=None):
with_=with_,
conn=None,
breaks=cmd.args.breaks,
binwidth=cmd.args.binwidth,
)
elif cmd.args.line[0] in {"bar"}:
return plot.bar(
Expand Down
84 changes: 72 additions & 12 deletions src/sql/plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from sql import exceptions, display
from sql.stats import _summary_stats
from sql.util import pretty_print

try:
import matplotlib.pyplot as plt
Expand Down Expand Up @@ -268,7 +269,7 @@ def _are_numeric_values(*values):
return all([isinstance(value, (int, float)) for value in values])


def _get_bar_width(ax, bins, bin_size):
def _get_bar_width(ax, bins, bin_size, binwidth):
"""
Return a single bar width based on number of bins
or a list of bar widths if `breaks` is given.
Expand All @@ -286,13 +287,18 @@ def _get_bar_width(ax, bins, bin_size):
Calculated bin_size from the _histogram function
or from consecutive differences in `breaks`

binwidth : int or float or None
Specified binwidth from a user

Returns
-------
width : float
A single bar width
"""
if _are_numeric_values(bin_size) or isinstance(bin_size, list):
width = bin_size
elif _are_numeric_values(binwidth):
width = binwidth
else:
fig = plt.gcf()
bbox = ax.get_window_extent()
Expand All @@ -318,6 +324,7 @@ def histogram(
ax=None,
facet=None,
breaks=None,
binwidth=None,
):
"""Plot histogram

Expand Down Expand Up @@ -371,10 +378,34 @@ def histogram(
f"Breaks given : {breaks}. When using breaks, please ensure that "
"breaks are strictly increasing."
)
if bins:

if _are_numeric_values(binwidth):
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved
if binwidth <= 0:
raise exceptions.ValueError(
"Both bins and breaks are specified. Must specify only one of them."
f"Binwidth given : {binwidth}. When using binwidth, please ensure to "
"pass a positive value."
)
binwidth = float(binwidth)
elif binwidth is None:
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved
pass
else:
raise exceptions.ValueError(
f"Binwidth given : {binwidth}. When using binwidth, please ensure to "
"pass a numeric value."
)

specified_args = [
args
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved
for args, specified in zip(
["bins", "breaks", "binwidth"], [bins, breaks, binwidth]
)
if specified
]
if len(specified_args) > 1:
raise exceptions.ValueError(
f"{pretty_print(specified_args)} are specified. "
"You can only specify one of them."
)

ax = ax or plt.gca()
payload["connection_info"] = conn._get_database_information()
Expand All @@ -393,9 +424,15 @@ def histogram(
raise ValueError("Column name has not been specified")

bin_, height, bin_size = _histogram(
table, column, bins, with_=with_, conn=conn, breaks=breaks
table,
column,
bins,
with_=with_,
conn=conn,
breaks=breaks,
binwidth=binwidth,
)
width = _get_bar_width(ax, bin_, bin_size)
width = _get_bar_width(ax, bin_, bin_size, binwidth)
data = _histogram_stacked(
table,
column,
Expand All @@ -406,6 +443,7 @@ def histogram(
conn=conn,
facet=facet,
breaks=breaks,
binwidth=binwidth,
)
cmap = plt.get_cmap(cmap or "viridis")
norm = Normalize(vmin=0, vmax=len(data))
Expand Down Expand Up @@ -449,9 +487,16 @@ def histogram(
ax.legend(handles[::-1], labels[::-1])
elif isinstance(column, str):
bin_, height, bin_size = _histogram(
table, column, bins, with_=with_, conn=conn, facet=facet, breaks=breaks
table,
column,
bins,
with_=with_,
conn=conn,
facet=facet,
breaks=breaks,
binwidth=binwidth,
)
width = _get_bar_width(ax, bin_, bin_size)
width = _get_bar_width(ax, bin_, bin_size, binwidth)

ax.bar(
bin_,
Expand All @@ -472,9 +517,16 @@ def histogram(
)
for i, col in enumerate(column):
bin_, height, bin_size = _histogram(
table, col, bins, with_=with_, conn=conn, facet=facet, breaks=breaks
table,
col,
bins,
with_=with_,
conn=conn,
facet=facet,
breaks=breaks,
binwidth=binwidth,
)
width = _get_bar_width(ax, bin_, bin_size)
width = _get_bar_width(ax, bin_, bin_size, binwidth)

if isinstance(color, list):
color_ = color[i]
Expand Down Expand Up @@ -505,7 +557,9 @@ def histogram(


@modify_exceptions
def _histogram(table, column, bins, with_=None, conn=None, facet=None, breaks=None):
def _histogram(
table, column, bins, with_=None, conn=None, facet=None, breaks=None, binwidth=None
):
"""Compute bins and heights"""
if not conn:
conn = sql.connection.ConnectionManager.current
Expand Down Expand Up @@ -576,15 +630,18 @@ def _histogram(table, column, bins, with_=None, conn=None, facet=None, breaks=No
query = template.render(
table=table, column=column, filter_query=filter_query
)
elif not isinstance(bins, int):
elif not binwidth and not isinstance(bins, int):
raise ValueError(
f"bins are '{bins}'. Please specify a valid number of bins."
)
else:
# Use bins - 1 instead of bins and round half down instead of floor
# to mimic right-closed histogram intervals in R ggplot
range_ = max_ - min_
bin_size = range_ / (bins - 1)
if binwidth:
bin_size = binwidth
else:
bin_size = range_ / (bins - 1)
template_ = """
select
ceiling("{{column}}"/{{bin_size}} - 0.5)*{{bin_size}} as bin,
Expand Down Expand Up @@ -638,6 +695,7 @@ def _histogram_stacked(
conn=None,
facet=None,
breaks=None,
binwidth=None,
):
"""Compute the corresponding heights of each bin based on the category"""
if not conn:
Expand All @@ -654,6 +712,8 @@ def _histogram_stacked(
cases.append(case)
cases[0] = cases[0].replace(">", ">=", 1)
else:
if binwidth:
bin_size = binwidth
tolerance = bin_size / 1000 # Use to avoid floating point error
for bin in bins:
# Use round half down instead of floor to mimic
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions src/tests/integration/test_generic_db_operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,12 +292,17 @@ def test_telemetry_execute_command_has_connection_info(
"%sqlplot histogram --with plot_something_subset --table\
plot_something_subset --column x --breaks 0 2 3 4 5"
),
(
"%sqlplot histogram --with plot_something_subset --table\
plot_something_subset --column x --binwidth 1"
),
],
ids=[
"histogram",
"hist",
"histogram-bins",
"histogram-breaks",
"histogram-binwidth",
],
)
@pytest.mark.parametrize(
Expand Down
Loading