Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram breaks #767

Merged
merged 7 commits into from
Aug 5, 2023
Merged

Histogram breaks #767

merged 7 commits into from
Aug 5, 2023

Conversation

bbeat2782
Copy link

@bbeat2782 bbeat2782 commented Jul 28, 2023

Describe your changes

Implemented --breaks/-B to ggplot histogram.

breaks demonstration.zip This notebook contains breaks demonstration and errors I implemented.

Issue number

Closes #719

Checklist before requesting a review


📚 Documentation preview 📚: https://jupysql--767.org.readthedocs.build/en/767/

@bbeat2782
Copy link
Author

breaks demonstration.zip

This is the implemented behavior of breaks. @edublancas Could you take a look at the demonstration and check whether it behaves as we want when you have time?

Since I have two other issues, this does not need an immediate check.

Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! don't forget to add some reference images and test cases

doc/api/magic-plot.md Show resolved Hide resolved
Comment on lines 82 to 90
bin_specified = False
pattern = r"histogram.*( --bins | -b )"
if re.search(pattern, line):
bin_specified = True
if re.search(r"( --breaks | -B )", line):
raise exceptions.UsageError(
"Both bins and breaks are specified. Must specify only one of them."
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use cmd to check if both arguments exist.

e.g.,

if cmd.args.bins and cmd.args.breaks:
    raise UsageError(...)

let me know if that works, otherwise we should fix it since validating arguments with regular expressions is going to be a pain if we have more complex combinations

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the default value of bins (which is 50 for the magic command), cmd.args.bins always return an integer value. Thus, I used the following code instead.

bin_specified = " --bins " in line or " -b " in line
breaks_specified = " --breaks " in line or " -B " in line
if bin_specified and breaks_specified:
     raise exceptions.UsageError(...)

changelog

test cases and fixes

comments added
@bbeat2782 bbeat2782 marked this pull request as ready for review July 31, 2023 20:57
tonykploomber
tonykploomber previously approved these changes Aug 1, 2023
Copy link

@tonykploomber tonykploomber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, very comprehensive test case covered, GJ

@neelasha23
Copy link

neelasha23 commented Aug 2, 2023

Would it make sense to allow users to pass a range of values?

points = list(range(1000, 6000, 500))
%sqlplot histogram --table penguins.csv --column body_mass_g --breaks {{points}}

@edublancas @bbeat2782

A few more observations:

  1. I'm not able to understand this result: If we specify %sqlplot histogram --table penguins.csv --column body_mass_g --bins 100 the plot is as below (x value goes upto 6000+)
Screenshot 2023-08-02 at 12 57 13 PM

But when we specify breaks where last break is less than highest x value it doesn't show data beyond the last point:

Screenshot 2023-08-02 at 12 58 41 PM

Shouldn't there be one more bin after 5000 till the highest value of body_mass_g ?

  1. UsageError: All break points are lower than the min data point. : It might help the user if we specify what the minimum point is (similar for max data point).

  2. There would be only 1 bin if first point is less than lower datapoint and second point is higher than max value. We can provide some info to guide user when specifying points:

Screenshot 2023-08-02 at 1 12 24 PM

Copy link

@yafimvo yafimvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Added some comments.

I think we should add some integration tests and make sure it works (or doesn't) on other databases (PostgreSQL, questdb, etc...)

src/sql/ggplot/geom/geom_histogram.py Show resolved Hide resolved
src/sql/magic_plot.py Show resolved Hide resolved
src/sql/plot.py Show resolved Hide resolved
if bins:
raise ValueError(
"Both bins and breaks are specified. Must specify only one of them."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error repeats itself. Can we delete the error handler from here?

Also, please change ValueError to exceptions.ValueError

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the error handler in magic_plot and changed ValueError to exceptions.ValueError

doc/api/magic-plot.md Show resolved Hide resolved
@bbeat2782
Copy link
Author

@neelasha23

allow users to pass a range of values

With this PR implementation, the following works.

points = " ".join(map(str, range(1000, 6000, 500)))
%sqlplot histogram --table penguins.csv --column body_mass_g --breaks {points}

Shouldn't there be one more bin after 5000 till the highest value of body_mass_g ?

  1. When specifying breaks, we are specifying a starting point, break points, and ending point. Thus, the data we query is limited to starting point <= data <= ending point. That is why the upper limit is 5000 here. This range limit is mimicking ggplot behavior in R. To avoid confusion, I added an explanation of breaks limiting range in magic-plot.md.

It might help the user if we specify what the minimum point is

  1. Changed the error message to UsageError: All break points are lower than the min data point of {min point}. (similar to the max data point)

There would be only 1 bin if first point is less than lower datapoint and second point is higher than max value. We can provide some info to guide user when specifying points:

  1. For this, do you think displaying a message will work better or giving a general info on the docs will work better?

@bbeat2782
Copy link
Author

@yafimvo

I think we should add some integration tests and make sure it works (or doesn't) on other databases (PostgreSQL, questdb, etc...)

Added test cases and images in test_questDB.py.

@edublancas
Copy link

@neelasha23 @yafimvo please review again

@neelasha23
Copy link

  1. For this, do you think displaying a message will work better or giving a general info on the docs will work better?

Docs should be fine

neelasha23
neelasha23 previously approved these changes Aug 3, 2023
Copy link

@neelasha23 neelasha23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, one thing is missing is adding one new test here:

you can just add an example that runs using --breaks, let's see if it passes. if you find compatibilities, ping me and we can decide whether we support it or not. my guess is that it'll work with the databases that we're testing there

@bbeat2782
Copy link
Author

one thing is missing is adding one new test here:

@edublancas I added histogram-breaks, and it passed all tests

@edublancas edublancas merged commit 424ff48 into ploomber:master Aug 5, 2023
22 checks passed
@edublancas
Copy link

awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement breaks in histogram's ggplot
5 participants