Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hist application does not use the user specified nbin parameter correctly. #3882

Closed
lavalaz opened this issue May 26, 2020 · 20 comments
Closed
Assignees
Labels
bug Something isn't working Products Issues which are impacting the products group

Comments

@lavalaz
Copy link

lavalaz commented May 26, 2020

ISIS version(s) affected: x.y.z
isis4.1.0 (and earlier)
Description

I carefully picked the min/max values and NBINS to put the bin boundaries at very specific DN values. But the application did not honor those user input limits and chose a min value larger than the user specified min and a max value less than the user specified max. Therefore the bin boundaries are not where the user intended.
How to reproduce

Pick any typical image with I/F values between 0 and 1.
hist from=[input.cub] minimum=0.0 maximum=1.0 nbin=10
This should but the bin boundaries at 0.0-0.1, 0.1-0.2, 0.2-0.3, 0.3-0.4, 0.4-0.5, 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, 0.9-1.0. Count them. Ten (10) bins. Hist produced 10 bins with "DN" of 0,0.11111,0.222222,0.333333,0.444444... ...0.88888,1. Ten bins alright, but not at 0.1 spacing. I guess I was wrong and 1/10=0.111111. My bad. So what DN's are in those bins? The output only specifies one DN value and I can't tell if its the max or the min of the bin. Since it starts with "0", clearly its the minimum value and the first bin has DN values of 0.0-0.111111. But since the last bin is "1", it must have DN values of 1 to ??? Oh, must be my mistake, the DN value must be the for the top of the bin so the last bin is for DN values of 0.88888-1.0. So the first bin has DNs of ??? to zero? But if I ask for 11 bins, I get bins of "0,0.1,0.2,0.3...1". 1/11=0.1, now I get it!
Possible Solution

Part 1: You need to give the values for minimum and maximum of the range of DNs that are going into each bin, not just a single DN value. You should also clearly document which side of the bin is a greater than/less than versus greater-than-or-equal-to/less-than-or-equal-to.
Part 2: You should also have a bin for "less-than-the-minimum" and "greater-than-the-maximum". So 12, clearly labeled, bins when the user says nbin=10.
Additional context

Just for historical context, I know this issue was raised about 15 years ago. The user in question, not myself, who had a degree in mathematics, was told that they did not understand the mathematics of histograms and this behavior was correct. That is why you are seeing extra snarkiness in this ticket.

@jessemapel
Copy link
Contributor

Just for historical context, I know this issue was raised about 15 years ago. The user in question, not myself, who had a degree in mathematics, was told that they did not understand the mathematics of histograms and this behavior was correct. That is why you are seeing extra snarkiness in this ticket.

The snarkiness is understandable, but it makes your bug report harder to read. The person your talking about doesn't work on ISIS anymore, so you're directing your snark towards people who are not related to it at all. I'd appreciate it if you removed the snarkiness from your bug report next time so that it's less likely to be misunderstood.

@lavalaz
Copy link
Author

lavalaz commented May 27, 2020

Yes, that is a fair request. I will try to do my best to not bring up things that happened a generation ago!

@jessemapel
Copy link
Contributor

I agree this math looks incorrect to me:

index = (int) floor((double)(nbins - 1) / (BinRangeEnd() - BinRangeStart() ) *
                            (data - BinRangeStart() ) + 0.5);

@kaitlyndlee kaitlyndlee self-assigned this Jun 8, 2020
@kaitlyndlee
Copy link
Contributor

kaitlyndlee commented Jun 9, 2020

@lavalaz Would it be possible for you to share a cube that would be good for testing, possibly the one you used in the original post?

@lavalaz
Copy link
Author

lavalaz commented Jun 9, 2020

Tied up in meetings, will try to get you one this evening. Will give you a link to one on /work somewhere...

@kaitlyndlee
Copy link
Contributor

kaitlyndlee commented Jun 10, 2020

@lavalaz You mentioned adding two extra bins to "catch" values outside of the range specified by the user. Would adding a parameter that allows users to select if they want the two extra bins be useful? Is there a use case for this functionality? It looks like the current functionality places values outside of the range in the first and last bins, e.g., bins 1 and 10 in the example.

In addition, would changing the CSV output to:

MinInclusive,MaxExclusive,Pixels,CumulativePixels,Percent,CumulativePercent

from:

DN,Pixels,CumulativePixels,Percent,CumulativePercent

meet the requirements of showing the bounds of each bin? Could this change negatively impact users? I am not familiar with hist or if it is used in any pipelines.

Lastly, some of these changes will affect other programs, e.g., cnethist, since they all use the same class in the background: Histogram. I assume we would need to update the output for all of these programs, as well. It looks like cnethist is the only other one that is used to generate histograms, is this correct?

@jessemapel If you have any opinions about this too.

@lavalaz
Copy link
Author

lavalaz commented Jun 10, 2020

You could make it a switch or just have them there always. There may be some users who complain about "extra" bins if they asked for a specific set of bins, so its not a bad idea but probably not essential.

@lavalaz
Copy link
Author

lavalaz commented Jun 11, 2020

After discussions and understanding more of how hist works, I am clarifying desired output:
Histogram with the number of bins the users asked for (no extra bins), no pixels below the requested min or above the max are included in the histogram.
Number of valid pixels below the min and number of valid pixels above max reported in the output. Total number of pixels and total valid pixels should both also be reported in the output, at it already is.

@kaitlyndlee
Copy link
Contributor

kaitlyndlee commented Jun 11, 2020

To clarify, I was wrong about pixels outside of the range being placed in the first and last bins. I missed a line of code and did some testing. However, we will still need to add how many pixels are outside of the range.

@jessemapel jessemapel added bug Something isn't working Products Issues which are impacting the products group labels Jun 12, 2020
@jessemapel
Copy link
Contributor

Because of how this change impacts the output format for hist, we're pushing the release of this bug fix back to August with the full feature release in 4.2

@lavalaz If you need this fix before then, let us know and we'll stage an internal build for you to use.

@blandoplanet
Copy link

The release candidate for 4.2 will be available in early July?

@jessemapel
Copy link
Contributor

@blandoplanet Yes

@lavalaz
Copy link
Author

lavalaz commented Jun 12, 2020

I don't need a special early version. Thank you!

@krlberry
Copy link
Contributor

PR needed to be temporarily reverted due to lots of failing tests. See PR #3922.

@jessemapel
Copy link
Contributor

Fixed by #4004

@rbeyer
Copy link
Contributor

rbeyer commented May 20, 2021

This change became available in ISIS 4.3.0, and changes the number of output columns in the text output file from hist, but I only just noticed it here half a year later.

For the record, this was a breaking change. It altered the default output of an ISIS program. This feature should have triggered 5.0.0 right then, it was not a non-breaking change. This is the kind of thing that the new SemVer approach is supposed to help us enforce, but it slipped through the cracks. I'll also note that this is the kind of thing that really is a breaking change hidden in a minor feature-release that can screw up a set of mission pipelines. It will affect HiRISE processing.

@audiefen
Copy link

Yes, we recently discovered this while testing ISIS 4.3.0. I agree that it is a breaking change. We are in the process of updating our pipelines to handle the changed output.

Changes like these are what add to the time it takes missions to test and update to newer versions of ISIS. The change seems small, but our pipelines can be complicated and the original developer might no longer be around. It takes us time to understand the original developers intent, the implications of making changes to the code and then to implement and test those changes. It is situations like this that impact which long term support release cadence would work best for mature missions. Changes take time and we are trying to keep up with smaller budgets, fewer people and an aging instrument.

@jessemapel
Copy link
Contributor

Good catch here. I think we need to come up with a better definition of what a breaking change is for ISIS. It should capture the library API, the app arguments, and the app outputs that people are writing pipelines against. I'm going to give this some thought over the weekend.

@jlaura
Copy link
Collaborator

jlaura commented May 21, 2021

From our previous work, I am pretty sure we are considering application output outside of the versioning schema. It looks like the change here is in a text file that is output by an application? If so, then isn't this change in the same spirit as #3227.

It sounds like we need to revisit the discussion about what an API change is and what the expectations are for parseable output in a different issue.

@rbeyer
Copy link
Contributor

rbeyer commented May 21, 2021

From our previous work, I am pretty sure we are considering application output outside of the versioning schema.

I thought the ISIS TC had come to the understanding that "the ISIS API" includes the arguments to userland programs and their output. In fact, there is this sentence in our 2020-02-11 meeting notes: "Currently the API is both the library API and the application API."

It sounds like we need to revisit the discussion about what an API change is and what the expectations are for parseable output in a different issue.

Sounds like we do if we aren't all on the same page.

If we're going to claim that adding a column of output to an output file isn't a breaking change (which is changing the "return signature" of that program), then this distinction between major and minor will appear to have no meaning for the users of the software. If we tell people that there will be no impact to their work when upgrading minor versions, but we are allowing changes like this, then we are lying to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Products Issues which are impacting the products group
Projects
None yet
Development

No branches or pull requests

8 participants