-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problem with qsort_r #41
Comments
It seems to me that the simplest scheme (unless someone who maintains GSW-C wants to get involved, providing a POSIX-compliant solution...) may be to patch Upside: This avoids what I called a "rolling fog" in a previous issue report ( #24 ), by simply letting the CRAN maintainers worry about compatibility across the 12 build systems (and I think we must multiply that number by either 2 or 3, to account for versions of R on which things are expected to work). Doing that should prevent GSW-R getting into a poor-reputation state with CRAN, caused by rapid resubmissions of problematic code. Downside: It complicates GSW-R document a bit, because it means we can no longer say we use the The R sort function that seems right is |
Dan, You need to take note of the comment in /*
** Note that the library functions using this utility
** depend on the fact that for replicate values in rdata
** the indexes are returned in descending sequence.
*/ which means that the sort routine you choose needs to be stable (ie. should preserve the order of items of equal value). The documentation for
|
Excellent point, Glenn. Thanks! |
In my understanding, the quicksort algorithm is generally not stable, in contrast to heapsort, for example; so either glibc qsort_r is not actually quicksort and is stable (unlike glibc qsort), or the C code is not consistent with the comment. https://www.gnu.org/software/libc/manual/html_node/Array-Sort-Function.html |
compare(const void *p1, const void *p2, void *rarray)
{
double *rdata = rarray;
if (rdata[*(int *)p1] < rdata[*(int *)p2])
return (-1);
if (rdata[*(int *)p1] > rdata[*(int *)p2])
return (1);
/*
** Note that the library functions using this utility
** depend on the fact that for replicate values in rdata
** the indexes are returned in descending sequence.
*/
if (*(int *)p1 < *(int *)p2)
return (1);
return (0);
} |
My recollection of heapsort's stability was wrong, but I think there is still an error in the code. From the manpage on OS X: From the manpage on linux: That implies to me that the only way to make the sort stable is to ensure that the compare function cannot return 0. In other words, that last return statement should be |
Yes you are right - the last return should be Note that this technique using the |
I wonder what @hylandg and @efiring (and others) think of the following. It could be a solution to (a) a complex system of ifdefs that could be brittle if GSW-C is ported to other machines and (b) the particular problem I have with the R case (as noted in the initiating comment of this issue). My idea is to use the bog-standard I made a little test case, and am putting it and its output below. The test data are burned into the C code, and were created by R (so I could check the results against R). I've not done much testing, but I get with this what I'd expect, when I order this particular test case by hand. Any thoughts, Glenn or Eric (or others?). It would be quite sweet if GSW-C could get rid of the Here's the code: #include <stdlib.h>
#include <stdio.h>
//#include "a.dat"
int nx=10; // from a.R
double x[10] = {4,3,2,2,4,4,1,2,2,2}; // data, from a.R
int e[10] = {7,3,4,8,9,10,2,1,5,6}; // expected order, from a.R
// above from a.dat (the only point of this is to easily try different test cases)
typedef struct {
double x;
int i;
} XI;
int compareXI(const void *a, const void *b)
{
XI *A = (XI*)a;
XI *B = (XI*)b;
if (A->x < B->x)
return(-1);
else if (A->x > B->x)
return(1);
else if (A->i < B->i)
return(-1);
else if (A->i > B->i)
return(1);
else {
fprintf(stderr, "programming error in compareXI function\n");
exit(1);
}
}
int main(int argc, char **argv)
{
// Fill up structure with data and initial index (the latter in C notation)
XI* xi = (XI*)malloc(nx*sizeof(XI));
for (int i = 0; i < nx; i++) {
xi[i].x = x[i];
xi[i].i = i;
}
// Sort this structure, comparing values first and then comparing
// indices, if values are identical.
qsort(xi, nx, sizeof(XI), compareXI);
printf("%20s %20s %20s %20s %10s\n", "x", "e", "xi.x", "1+xi.i", "OK?");
// Show the results. We expect the fourth column to equal the second.
for (int i = 0; i < nx; i++) {
int OK = e[i] == xi[i].i + 1;
printf("%20.1lf %20d %20.1lf %20d %10s\n", x[i], e[i], xi[i].x, 1+xi[i].i, OK?"yes":"no");
}
free(xi);
return 0;
} and here's the output (which I think demonstrates correctness, albeit on just one little test)
NOTE I have not looked into stitching this into the |
To explore the idea further, I forked GSW-C and switched the code to use this proposed method. The results are at https://github.com/dankelley/GSW-C What I have compiles, and I could do a pull request if that seems sensible to others. I've not done that yet because others might think I'm being a bit pushy here. After all, I may be the only one in the TEOS-10 group who is disquieted by the |
Dan, I think the general idea is good, but your first shot at the implementation has problems:
|
Thanks, Eric -- those three things are altered in the most recent commit in my fork at https://github.com/dankelley/GSW-C Question: Can you explain why the original code handles the sort-on-index in that way? It seems incorrect to me, if the goal is to get a list sorted by increasing value, using lower indices to break ties. (I've not looked at any of the embedding code, though. Maybe this gets used for something that is in reverse order from the start, or something.) |
The comments in the code that calls this say that it doesn't matter; it just has to be consistent. I haven't worked out the algorithm in detail, so it's not clear to me whether the direction of consistent tie-breaking matters. The sorting is applied to the concatenation of the sorted x-from and the x-to values, so in cases where an x-from matches an x-to, the tie-breaking determines which comes first. Logically, which consistent choice is made shouldn't matter; if it did, then one would get different results if one inverted the order of all of the input arrays. So if you want to keep your original order, which looks nicer ( greater-than always means return 1), then I'm confident that would be OK. |
1. Sort-on-index. I reckon it may be best for my suggested change to be similar to the original, to avoid confusion. Therefore I propose to keep what I have in my fork, viz. an echo of the old code. 2. Thread safety. I don't know whether a switch from Perhaps @hylandg has some thoughts on these things, since I see from git-blame that he wrote the original code in Aug 2015 and (seems like a good month!) Aug 2016. |
https://access.redhat.com/solutions/172383 I think the answer is that if qsort is not thread-safe, then that is a bug in the particular libc implementation, so it's fine to use it in GSW-C. I wouldn't worry about it. |
I've spent sometime going over the code and running though some detailed examples, and I've determined that the direction of tie-breaking doesn't matter, and in fact whether the sort is stable or not doesn't matter. The final result from So I believe using |
I really want to move away from linear interpolation, it is just for "rough & ready cowboy/cowgirl " oceanographers. I have started doing this in the matlab version where the worst ting that is resorted to is a spline with tension. With a high tension value the results are similar to linear but the derivative is continuous. |
Paul, that sounds like a good move, but I am concerned it may leave other languages behind. For example, if I do the conversion from the latest Matlab v3.06 release to Fortran how do I deal with Matlab specific functions such as |
I thought that pinv is included in the lapack library or is that not the case ? |
Paul, don't forget:
I take strong issue with your blanket condemnation of linear interpolation, and your disdain for those who use it. What are the situations where the conclusions reached from a calculation on a data set will be substantially improved by a fancier algorithm? And what are the situations where the difference it makes is negligible compared to the other uncertainties? How often will TEOS-10 be used in the former situation, and how often in the latter? |
linear interpolation is one of my pet hates, it caught me out once when I was still a student, it took me weeks to sort out, and then it was only by talking to the guys who used to get their data from niskin bottles in the 50's, 60's and 70's. In modern data almost any interpolation is ok, but having a continuous derivative in the resulting data makes a method make far more sense than that not having one. We found that with modern data we can eliminate the very small scale instabilities by adjusting salinity less than the accuracy of the instrumentation, so it really comes down to how much you believe the raw data and those processing it. Looking over the processor shoulders I have noticed that the data given out is strongly influenced by each of the operators. One must also think what they want to do with the data, if you are looking at mixing then you will not want to remove those instabilities. But if you are looking at global circulation or long term signals then keeping short time events that might last just 20 mins is questionable. We are now doing our interpolation on the SA-CT diagram. Using linear interpolation with SA-p and CT-p produced unrealistic SA-CT curves as did the Reininger & Ross (1968) method. In the Southern Ocean there are so few deep casts we need to reconstruct the water column the best we can. I agree we do not know what was occurring between our measure bottles but we can make an educated guess using what we know from modern data. |
The last few comments on this thread have opened up the chance for a wide-ranging discussion, touching on aspects of data collection and data processing, as well as oceanography. (But please see my last paragraph for a suggestion for the future of the discussion...) Regarding interpolation vs smoothing, it seems to me that part of the challenge in identifying best practices is the simple fact that measurement schemes have changed a great deal through the years. It is also fair to say that different applications demand different methodologies in both observation and analysis. It seems unlikely that things will settle down anytime soon, because whole new classes of observational technologies are arriving like waves on a shore. I think flexibility is the key word. Based on discussions I've had with a variety of oceanographers about TEOS-10 over the years, I wonder if it's time to consider splitting TEOS-10 into two parts.
A set of us has discussed the idea of having a web conference call, perhaps annually, to discuss the status of the various software manifestations of TEOS-10. The above-stated scheme is one thing that I would likely bring up in such a discussion. I would argue that the TEOS-10/GSW code be reframed to only deal with PART 1, and that anything related to PART 2 should go into a whole new package/library. An advantage of that would be that PART 1 would vary slowly, obtain wide agreement, and be supported well by peer-reviewed literature, so that it could get periodic approval by the Intergovernmental Oceanographic Commission. (As far as I know, the IOC approval was only for the original polynomials ... maybe I'm wrong on that, though.) I realize that this comment does not really belong in this individual issue thread. Indeed, it doesn't belong in the TEOS-10/GSW-R repository, because it's a general topic for discussion that goes beyond any of the secondary variants of the base TEOS-10/GSW algorithms. If Paul were to create a new repository on TEOS-10 for general discussion, that's where I should put this comment ... but only if this smaller list think it has enough merit to get a wider airing. |
I updated the source |
- completely removed the god-awful, and unnecessarily complex code from gsw_linear_interp_sa_ct / gsw_util_interp1q_int - new code is simpler and easier to understand and maintain - new code runs much faster (approx. 6x speedup) - removed now redundant routine gsw_util_sort_real (which was causing issues in other language variants, eg. TEOS-10/GSW-R#41)
I tried submitting the present version to CRAN, but got complaints as follows. The CRAN maintainers are quite stringent about such problems, and have asked me to address this "ASAP". (I was surprised that the complaint was not also about
qsort_s
, additionally, since that doesn't seem to be POSIX, either.) CRAN is quite stringent about these things.The text was updated successfully, but these errors were encountered: