Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram binwidth #789

Merged
merged 6 commits into from
Aug 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## 0.10.0dev

* [Feature] Add `--binwidth/-W` to ggplot histogram for specifying binwidth (#784)
* [Feature] Add `%sqlcmd profile` support for DBAPI connections (#743)
* [Fix] Perform `ROLLBACK` when SQLAlchemy raises `PendingRollbackError`
* [Fix] Perform `ROLLBACK` when `psycopg2` raises `current transaction is aborted, commands ignored until end of transaction block`
Expand Down
22 changes: 19 additions & 3 deletions doc/api/magic-plot.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.7
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
Expand Down Expand Up @@ -128,8 +128,14 @@ Shortcut: `%sqlplot hist`

`-B`/`--breaks` Custom bin intervals

`-W`/`--binwidth` Width of each bin

`-w`/`--with` Use a previously saved query as input data

```{note}
When using -b/--bins, -B/--breaks, or -W/--binwidth, you can only specify one of them. If none of them is specified, the default value for -b/--bins will be used.
```

+++

Histogram supports NULL values by skipping them. Now we can
Expand All @@ -147,20 +153,30 @@ When plotting a histogram, it divides a range with the number of bins - 1 to cal

+++

### Number of bins
### Specifying bins

Bins allow you to set the number of bins in a histogram, and it's useful when you are interested in the overall distribution.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --bins 100
```

### Specifying breaks

Breaks allow you to set custom intervals for a histogram. You can specify breaks by passing desired each end and break points separated by whitespace after `-B/--breaks`. Since those break points define a range of data points to plot, bar width, and number of bars in a histogram, make sure to pass more than 1 point that is strictly increasing and includes at least one data point. Note that using both `-b/--bins` and `-B/--breaks` isn't allowed.
Breaks allow you to set custom intervals for a histogram. It is useful when you want to view distribution within a specific range. You can specify breaks by passing desired each end and break points separated by whitespace after `-B/--breaks`. Since those break points define a range of data points to plot, bar width, and number of bars in a histogram, make sure to pass more than 1 point that is strictly increasing and includes at least one data point.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --breaks 3200 3400 3600 3800 4000 4200 4400 4600 4800
```

### Specifying binwidth

Binwidth allows you to set the width of bins in a histogram. It is useful when you directly aim to adjust the granularity of the histogram. To specify the binwidth, pass a desired width after `-W/--binwidth`. Since the binwidth determines details of distribution, make sure to pass a suitable positive numeric value based on your data.

```{code-cell} ipython3
%sqlplot histogram --table penguins.csv --column body_mass_g --binwidth 150
```

### Multiple columns

```{code-cell} ipython3
Expand Down
9 changes: 8 additions & 1 deletion src/sql/ggplot/geom/geom_histogram.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,19 @@ class geom_histogram(geom):

breaks : list
Divide bins with custom intervals

binwidth : int or float
Width of each bin
"""

def __init__(self, bins=None, fill=None, cmap=None, breaks=None, **kwargs):
def __init__(
self, bins=None, fill=None, cmap=None, breaks=None, binwidth=None, **kwargs
):
self.bins = bins
self.fill = fill
self.cmap = cmap
self.breaks = breaks
self.binwidth = binwidth
super().__init__(**kwargs)

@telemetry.log_call("ggplot-histogram")
Expand All @@ -45,5 +51,6 @@ def draw(self, gg, ax=None, facet=None):
facet=facet,
ax=ax or gg.axs[0],
breaks=self.breaks,
binwidth=self.binwidth,
)
return gg
15 changes: 11 additions & 4 deletions src/sql/magic_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@ class SqlPlotMagic(Magics, Configurable):
nargs="+",
help="Histogram breaks",
)
@argument(
"-W",
"--binwidth",
type=float,
help="Histogram binwidth",
)
@modify_exceptions
def execute(self, line="", cell="", local_ns=None):
"""
Expand Down Expand Up @@ -110,13 +116,13 @@ def execute(self, line="", cell="", local_ns=None):
conn=None,
)
elif cmd.args.line[0] in {"hist", "histogram"}:
# to avoid passing bins default value when breaks are given by a user
# to avoid passing bins default value when breaks or binwidth is specified
bin_specified = " --bins " in line or " -b " in line
breaks_specified = " --breaks " in line or " -B " in line
if breaks_specified and not bin_specified:
binwidth_specified = " --binwidth " in line or " -W " in line
bins = cmd.args.bins
if not bin_specified and any([breaks_specified, binwidth_specified]):
bins = None
else:
bins = cmd.args.bins

return plot.histogram(
table=table,
Expand All @@ -125,6 +131,7 @@ def execute(self, line="", cell="", local_ns=None):
with_=with_,
conn=None,
breaks=cmd.args.breaks,
binwidth=cmd.args.binwidth,
)
elif cmd.args.line[0] in {"bar"}:
return plot.bar(
Expand Down
83 changes: 67 additions & 16 deletions src/sql/plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

from sql import exceptions, display
from sql.stats import _summary_stats
from sql.util import _are_numeric_values, validate_mutually_exclusive_args
from sql.display import message

try:
import matplotlib.pyplot as plt
Expand Down Expand Up @@ -264,11 +266,7 @@ def _min_max(con, table, column, with_=None, use_backticks=False):
return min_, max_


def _are_numeric_values(*values):
return all([isinstance(value, (int, float)) for value in values])


def _get_bar_width(ax, bins, bin_size):
def _get_bar_width(ax, bins, bin_size, binwidth):
"""
Return a single bar width based on number of bins
or a list of bar widths if `breaks` is given.
Expand All @@ -286,13 +284,18 @@ def _get_bar_width(ax, bins, bin_size):
Calculated bin_size from the _histogram function
or from consecutive differences in `breaks`

binwidth : int or float or None
Specified binwidth from a user

Returns
-------
width : float
A single bar width
"""
if _are_numeric_values(bin_size) or isinstance(bin_size, list):
width = bin_size
elif _are_numeric_values(binwidth):
width = binwidth
else:
fig = plt.gcf()
bbox = ax.get_window_extent()
Expand All @@ -318,6 +321,7 @@ def histogram(
ax=None,
facet=None,
breaks=None,
binwidth=None,
):
"""Plot histogram

Expand Down Expand Up @@ -371,10 +375,23 @@ def histogram(
f"Breaks given : {breaks}. When using breaks, please ensure that "
"breaks are strictly increasing."
)
if bins:

if _are_numeric_values(binwidth):
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved
if binwidth <= 0:
raise exceptions.ValueError(
"Both bins and breaks are specified. Must specify only one of them."
f"Binwidth given : {binwidth}. When using binwidth, please ensure to "
"pass a positive value."
)
binwidth = float(binwidth)
elif binwidth is not None:
raise exceptions.ValueError(
f"Binwidth given : {binwidth}. When using binwidth, please ensure to "
"pass a numeric value."
)

validate_mutually_exclusive_args(
["bins", "breaks", "binwidth"], [bins, breaks, binwidth]
)

ax = ax or plt.gca()
payload["connection_info"] = conn._get_database_information()
Expand All @@ -393,9 +410,15 @@ def histogram(
raise ValueError("Column name has not been specified")

bin_, height, bin_size = _histogram(
table, column, bins, with_=with_, conn=conn, breaks=breaks
table,
column,
bins,
with_=with_,
conn=conn,
breaks=breaks,
binwidth=binwidth,
)
width = _get_bar_width(ax, bin_, bin_size)
width = _get_bar_width(ax, bin_, bin_size, binwidth)
data = _histogram_stacked(
table,
column,
Expand All @@ -406,6 +429,7 @@ def histogram(
conn=conn,
facet=facet,
breaks=breaks,
binwidth=binwidth,
)
cmap = plt.get_cmap(cmap or "viridis")
norm = Normalize(vmin=0, vmax=len(data))
Expand Down Expand Up @@ -449,9 +473,16 @@ def histogram(
ax.legend(handles[::-1], labels[::-1])
elif isinstance(column, str):
bin_, height, bin_size = _histogram(
table, column, bins, with_=with_, conn=conn, facet=facet, breaks=breaks
table,
column,
bins,
with_=with_,
conn=conn,
facet=facet,
breaks=breaks,
binwidth=binwidth,
)
width = _get_bar_width(ax, bin_, bin_size)
width = _get_bar_width(ax, bin_, bin_size, binwidth)

ax.bar(
bin_,
Expand All @@ -472,9 +503,16 @@ def histogram(
)
for i, col in enumerate(column):
bin_, height, bin_size = _histogram(
table, col, bins, with_=with_, conn=conn, facet=facet, breaks=breaks
table,
col,
bins,
with_=with_,
conn=conn,
facet=facet,
breaks=breaks,
binwidth=binwidth,
)
width = _get_bar_width(ax, bin_, bin_size)
width = _get_bar_width(ax, bin_, bin_size, binwidth)

if isinstance(color, list):
color_ = color[i]
Expand Down Expand Up @@ -505,7 +543,9 @@ def histogram(


@modify_exceptions
def _histogram(table, column, bins, with_=None, conn=None, facet=None, breaks=None):
def _histogram(
table, column, bins, with_=None, conn=None, facet=None, breaks=None, binwidth=None
):
"""Compute bins and heights"""
if not conn:
conn = sql.connection.ConnectionManager.current
Expand Down Expand Up @@ -576,15 +616,23 @@ def _histogram(table, column, bins, with_=None, conn=None, facet=None, breaks=No
query = template.render(
table=table, column=column, filter_query=filter_query
)
elif not isinstance(bins, int):
elif not binwidth and not isinstance(bins, int):
raise ValueError(
f"bins are '{bins}'. Please specify a valid number of bins."
)
else:
# Use bins - 1 instead of bins and round half down instead of floor
# to mimic right-closed histogram intervals in R ggplot
range_ = max_ - min_
bin_size = range_ / (bins - 1)
if binwidth:
bin_size = binwidth
if binwidth > range_:
message(
f"Specified binwidth {binwidth} is larger than "
f"the range {range_}. Please choose a smaller binwidth."
)
else:
bin_size = range_ / (bins - 1)
template_ = """
select
ceiling("{{column}}"/{{bin_size}} - 0.5)*{{bin_size}} as bin,
Expand Down Expand Up @@ -638,6 +686,7 @@ def _histogram_stacked(
conn=None,
facet=None,
breaks=None,
binwidth=None,
):
"""Compute the corresponding heights of each bin based on the category"""
if not conn:
Expand All @@ -654,6 +703,8 @@ def _histogram_stacked(
cases.append(case)
cases[0] = cases[0].replace(">", ">=", 1)
else:
if binwidth:
bin_size = binwidth
tolerance = bin_size / 1000 # Use to avoid floating point error
for bin in bins:
# Use round half down instead of floor to mimic
Expand Down
24 changes: 24 additions & 0 deletions src/sql/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,3 +494,27 @@ def get_default_configs(sql):
del default_configs["parent"]
del default_configs["config"]
return default_configs


def _are_numeric_values(*values):
return all([isinstance(value, (int, float)) for value in values])


def validate_mutually_exclusive_args(arg_names, args):
neelasha23 marked this conversation as resolved.
Show resolved Hide resolved
"""
Raises ValueError if a list of values from arg_names filtered by
args' boolean representations is longer than one.

Parameters
----------
arg_names : list
args' names in string
args : list
args values
"""
specified_args = [arg_name for arg_name, arg in zip(arg_names, args) if arg]
if len(specified_args) > 1:
raise exceptions.ValueError(
f"{pretty_print(specified_args)} are specified. "
"You can only specify one of them."
)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions src/tests/integration/test_generic_db_operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,12 +292,17 @@ def test_telemetry_execute_command_has_connection_info(
"%sqlplot histogram --with plot_something_subset --table\
plot_something_subset --column x --breaks 0 2 3 4 5"
),
(
"%sqlplot histogram --with plot_something_subset --table\
plot_something_subset --column x --binwidth 1"
),
],
ids=[
"histogram",
"hist",
"histogram-bins",
"histogram-breaks",
"histogram-binwidth",
],
)
@pytest.mark.parametrize(
Expand Down
Loading