-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check for index in setkey #3582
Conversation
@saraswatmks Excellent! Have invited you to be project member (you need to accept the notification) so next time you can create a branch in the main project. |
I don't see |
@mattdowle thanks for your quick feedback. I've made the change. Now we check:
I am on macos. No matter what I do I can't run tests locally. I need to push everytime and wait for pipeline to throw some error. Could you guide me to some link so that I can test it locally before pushing ? I have already tried this and it hasn't helped. |
Codecov Report
@@ Coverage Diff @@
## master #3582 +/- ##
==========================================
+ Coverage 97.79% 97.79% +<.01%
==========================================
Files 66 66
Lines 12904 12909 +5
==========================================
+ Hits 12620 12625 +5
Misses 284 284
Continue to review full report at Codecov.
|
hi @saraswatmks could you share more details about what's not working? especially your locale info, that seems to be the most common source of headaches |
@saraswatmks |
R/setkey.R
Outdated
|
||
# get existing index name if any | ||
found_index <- NULL | ||
if(is.null(indices(x))){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be !is.null
here? iiuc that's why line 107 isn't covered (see codecov results in the conversation tab).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, my bad
@saraswatmks at which point |
@MichaelChirico I get this error when I run
|
@jangorecki Thanks for helping. When I run
|
Here's my sessionInfo
|
@saraswatmks install those two missing packages, or use |
@jangorecki Just to make sure I am doing the right way, what is the common way to do this ? For ex: I am doing the following:
Am I following the right way? Or there is something in Also, after I installed both the dependencies, I did
|
cd data.table
make build && make check |
to each his own but I cringe at the word I've been thinking it might be nice to have a short vignette or blog post or wiki page on "how to fix a data.table issue for a first time contributor" giving some tips and tricks on getting over the hump. you might want to check the file cc.R which has a lot of (densely packed) accumulated knowledge on debugging & package development |
PS so it doesn't get lost in the trees -- thanks for the PR! we really appreciate all community efforts at fixing&improving the package & hope the experience is more illuminating than frustrating 😁 |
R/setkey.R
Outdated
} | ||
|
||
# forder only if index is not present | ||
if(!identical(found_index, cols)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works if there is one index, as the test tests. But there can be multiple indexes (each index is a set of columns). It needs to find if any of the indexes match cols
and then use that one. The test needs expanding for cases of multiple indexes and cols
both existing and not existing when there are multiple (so it tests it picks up the correct one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it.
R/setkey.R
Outdated
} else { | ||
o <- forderv(x, cols, sort=TRUE, retGrp=FALSE) | ||
cat("using existing index for", found_index, "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs an if (verbose)
please otherwise this is always being printed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it.
R/setkey.R
Outdated
cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n") | ||
|
||
# get existing index name if any | ||
found_index <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer =
in data.table please. I don't agree with the common and widely marketed advice to use <-
. Single =
is the same as C and many other languages. I sometimes hear, e.g. from Python folk, that R code using <-
looks "old"/"not modern" and I see their point. I've always used =
. I save <-
for use when passing function arguments to do an assign and pass in one go; e.g. write(DT, file=f<-tempfile()); ... do something with f ...
. I'm often swapping between C and R (which I think more people should do too since C is not as hard as some people want you to believe). When doing this, using =
to mean assign and ==
to mean equals in two languages consistently (R and C) is nice. And by the way, the people who laugh at R for using L
postfix to mean integer ... it comes from C (it's the same as C) and it makes a lot of sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for detailed insight. made the change.
@MichaelChirico Thanks to you guys for your kindness and patience. I've been wanting to contribute since a long time. |
@saraswatmks Thanks for contributing, and welcome!
I made some comments above to be addressed please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for making the previous changes.
More requests below. I hope they make sense and I tried to explain the rationale behind them for future reference. Almost there! I wrote quite a bit intending a series of comments in order, but they seem to have appeared in a different order after I submitted: just read them all quickly first before getting bogged down into any particular one.
Also, please add a news item, and please add yourself to the bottom of contributors list in DESCRIPTION. Your name will then appear on CRAN contributor field on the next update.
inst/tests/tests.Rraw
Outdated
aaa = c(1,1,2,2,2,1,1,2,2,2)) | ||
setindex(DT, a) | ||
test(1419.60, allIndicesValid(DT), TRUE) | ||
test(1419.61, setkey(DT, a, verbose=TRUE), output="using existing index for a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check on output=
is necessary and good but not quite sufficient. It's possible that setkey() prints this message but then doesn't set the key properly. So there needs to be another test just after this one that makes sure the key has been changed. Unfortunately the test data chosen doesn't make this a good test because the input data is trivially sorted already. So my first thought is to change the input data so it's random e.g. DT = data.table(a=c(2,3,2,1,3,2,1), aaa=...) then add test 1419.62 to check that DT$a is c(1,1,2,2,2,3,3) afterwards. Or something like that. Then it is checking both that i) the setkey() has changed the physical order and ii) the output=
also checks that it changed the order for the correct reason (using the existing index).
R/setkey.R
Outdated
# get existing index name if any | ||
found_index = NULL | ||
if(!is.null(indices(x))) found_index <- names(attributes(attributes(x)$index)) | ||
new_possible_index = paste0("__", cols, collapse="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 lines would be simpler as just one line :
index = paste0(cols, collapse="__")
See the next 3 comments in combination below ...
R/setkey.R
Outdated
new_possible_index = paste0("__", cols, collapse="") | ||
|
||
# forder only if index is not present | ||
if(!any(new_possible_index == found_index)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this if()
would be :
if (!any(index == indices(x))
If x doesn't have any indices, indices(x) returns NULL and any("anything"==NULL) is FALSE in R (which is nice).
The idea is just to save repeating variable names and use fewer lines of code so we have less to maintain in future. Not too few lines to the extent of making it hard to understand. But in this case, what I'm suggesting is simpler and worth doing and easier to read and check.
Using indices() is cleaner than fetching the attributes() directly, because we have an isolating interface. If we ever change the attribute structure or names in future, we only need to change the code inside indices() not try and find everywhere that reads the attributes directly.
R/setkey.R
Outdated
} else { | ||
o = forderv(x, cols, sort=TRUE, retGrp=FALSE) | ||
# find the matching index | ||
ix = found_index[which(found_index == new_possible_index)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this line can be removed.
R/setkey.R
Outdated
cat("using existing index for", gsub("^__","", ix), "\n") | ||
o <- attr(attributes(x)$index, which=ix, exact = TRUE) | ||
} else { | ||
o <- attr(attributes(x)$index, which=ix, exact = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And these two which=ix
become which=index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdowle I've made the changes. Correct me if I am wrong, I think we need ix
here because, let's take a scenario where a user set multiple indexes. So when we do indices(DT)
, we get a, b, __a__b
, if the user does setkey(DT, a), setkey(DT, b), setkey(DT, a, b)
for a table DT = data.table(a = c(...), b = c(...), c = c(...))
. In this case, we need to find the ix
index of which index is matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - thanks again. Almost there!
Look at line 104 though. I commented before that line 104 could be removed. Line 104 is, in the end, just renaming index
as idx
, iiuc. This which=
is looking up the item by name anyway. So line 104 can be removed and then replace this which=idx
with which=index
.
Once that's done, perhaps rename the local variable index
to thiskey
or newkey
? That way it reads a little betters and avoids the same name "index" as the attribute name. For example, line 94 will probably read a little better as if (!any(indices(x)==newkey))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for your patience.
inst/tests/tests.Rraw
Outdated
setindex(DT, a) | ||
setindex(DT, aaa) | ||
test(1419.62, allIndicesValid(DT), TRUE) | ||
test(1419.63, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same request here as above for 1419.61. This test needs to check the result of this setkey() command has done the correct thing (e.g. changed the physical row order properly) as well as doing that for the right reason (output=
).
Since setkey() returns DT invisibly, the easiest way is just to pass an appropriate y=data.table(...) into this test(). See other tests() in this file for examples for single calls to test() which use x, y and output= too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for previous. Still the results need checking in tests. It's ok - I'll do them. Since setkey()
is a core part of data.table we have to ensure any changes to it are nailed down.
inst/tests/tests.Rraw
Outdated
test(1419.62, setkey(DT, a, verbose=TRUE), output="using existing index for a") | ||
|
||
# check setkey incase of existing multiple indexes | ||
DT <- data.table(a = c(3,3,4,4,5,6,1,1,7,2,2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
has 11 items here but aaa
and bbb
have 10 so there's a warning when running test.data.table() runs in dev and when users would run test.data.table(). It's a lot easier and faster to run locally first before pushing. See the cc.R
script in the root of the project : that's what I use in dev.
inst/tests/tests.Rraw
Outdated
setindex(DT, a) | ||
setindex(DT, aaa) | ||
test(1419.65, allIndicesValid(DT), TRUE) | ||
test(1419.66, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't check the result of setkey is correct. Please see previous comments where that was requested : #3582 (comment) and #3582 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for chipping in. I wasn't fully aware of test
function parameters. Now I understand it better. Also, I will try to fix my local environment (I still don't know how am I gonna do that) to avoid stupid errors in the PR.
I added the result checks to the tests. |
inst/tests/tests.Rraw
Outdated
options(datatable.auto.index = TRUE) | ||
test(1376.12, list(DT[a==2L], indices(DT)), list(DT[9L],"a")) | ||
test(1376.12, list(DT[a==2L], indices(DT)), list(DT[2L],"a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 tests changes from 9L to 2L were strange that they were needed. Anyway, the latest commit leaves these tests unchanged which seems more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep: those 3 tests should not have been changed to 9L in this PR. The setkey(DT,b) inbetween tests 1376.07 and 1376.08 had broken under this PR and wasn't changing the row order properly.
This change will be superseded (possibly taking it all away during git merge) by #4386 which do same optimization (among others) but on C level rather than R. |
Closes #2889
The idea the following:
setkeyv
if the data.table already has an indexsetindex
inside it again.