bug in re-adding NAs in df2genind with unexpected result #96

romunov · 2015-10-15T21:28:24Z

When a table of alleles is constructed in df2genind (towards the end), NAs are reinserted. I suspect that part of evaluating a subsetted object is coercing of factors of characters with "numeric" names (e.g. "720"), NAs are not added to corresponding rows but rather to sequential number the factor level is assigned. First two examples and a small proof of concept that I hope are easy to follow:

Take for example this small example from ?df2genind.

library(adegenet)
df <- data.frame(locusA=c("11","11","12","32"),
                 locusB=c(NA,"34","55","15"),locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)

          locusA locusB locusC
genotype1     11   <NA>     22
genotype2     11     34     22
genotype3     12     55     21
genotype4     32     15     22

obj <- df2genind(df, ploidy=2, ncode=1)
y <- tab(obj)

df["genotype1", ] # NAs on locusB, as expected
          locusA locusB locusC
genotype1     11   <NA>     22
df[rownames(df) == "genotype1", grepl("locusB", colnames(df))] # value on locusB -- NA, OK
[1] <NA>
Levels: 15 34 55
y[rownames(y) %in% "genotype1", grepl("locusB", colnames(y))] # values on locusB -- NAs, OK
locusB.3 locusB.4 locusB.5 locusB.1 
      NA       NA       NA       NA

All is fine as rownames cannot be coerced to numeric.

Then take this small dataset. Bear with me.

slc <- structure(list(AnimalID = structure(c(6L, 1L, 4L, 5L, 3L, 2L), .Label = c("498", 
"533", "544", "677", "678", "730"), class = "factor"), Samp = structure(c(6L, 
1L, 4L, 5L, 3L, 2L), .Label = c("AP.07P4", "AP.07UM", "AP.07XM", 
"AP.088P", "AP.088T", "AX.0CCE"), class = "factor"), INRA21 = structure(c(3L, 
1L, 2L, 4L, 4L, 3L), .Label = c("092 092", "092 096", "092 098", 
"096 098"), class = "factor"), AHT137 = structure(c(3L, 1L, 6L, 
2L, 4L, 5L), .Label = c("124 142", "124 148", "132 132", "134 146", 
"134 148", "140 146"), class = "factor"), REN169D01 = structure(c(NA, 
4L, 3L, 2L, 1L, 2L), .Label = c("198 198", "198 204", "204 204", 
"204 208"), class = "factor"), AHTk253 = structure(c(3L, 1L, 
1L, 1L, 2L, 2L), .Label = c("280 280", "280 286", "284 286"), class = "factor")), .Names = c("AnimalID", 
"Samp", "INRA21", "AHT137", "REN169D01", "AHTk253"), row.names = 12:17, class = "data.frame")

   AnimalID    Samp  INRA21  AHT137 REN169D01 AHTk253
12      730 AX.0CCE 092 098 132 132      <NA> 284 286
13      498 AP.07P4 092 092 124 142   204 208 280 280
14      677 AP.088P 092 096 140 146   204 204 280 280
15      678 AP.088T 096 098 124 148   198 204 280 280
16      544 AP.07XM 096 098 134 146   198 198 280 286
17      533 AP.07UM 092 098 134 148   198 204 280 286

g <- df2genind(X = slc[, !grepl("AnimalID|Samp", colnames(slc))], sep = " ", 
               ind.names = slc$AnimalID, ncode = 6, NA.char = NA)

slx <- tab(g)

slc[slc$AnimalID %in% c("730"), grepl("REN169D01", colnames(slc))] # individual "730" has NAs on locu REN169D01
[1] <NA>
Levels: 198 198 198 204 204 204 204 208

slx[rownames(slx) %in% c("730"), grepl("REN169D01", colnames(slx))] # these values should be NAs, but are 0
REN169D01.204 REN169D01.208 REN169D01.198 
            0             0             0 

slx[, grepl("REN169D01", colnames(slx))] # notice that NAs have been erroneously assigned to individual 533
    REN169D01.204 REN169D01.208 REN169D01.198
730             0             0             0
498             1             1             0
677             2             0             0
678             1             0             1
544             0             0             2
533            NA            NA            NA

I believe line 289 is responsible for this. NA.ind[i] is coerced to numeric and we all know what happens when we coerce a factor to numeric without an intermediate character step. Here's a proof of concept of what I think is going on.

mydf <- data.frame(a = c("aa", "bb", "cc"), 
                   b = c("dd", "ee", "ff"))
rownames(mydf) <- c("10", "11", "12")

mydf
    a  b
10 aa dd
11 bb ee
12 cc ff

rn <- factor(c("12", "11", "10"), levels = c("11", "12", "10"))
rn <- rn[1]
i <- 1
# rn[1] is 12

rn[1]
[1] 12
Levels: 11 12 10

# however rowname "12" is not selected, but rather row number two
mydf[rn[i], c("a", "b")]
    a  b
11 bb ee

# I think that during evaluation of `rn[1]` factor("12") is being coerced to numeric with respect to the rest of the factor levels
as.numeric(rn[1])
[1] 2

#which subsets the second row named "11"

One solution would be to replace line 289 with

out[rownames(out) %in% NA.ind[i], grep(loc, out.colnames)] <- NA

The issue can be also avoided if

slc$AnimalID <- paste("animol", 1:nrow(slc))

is called prior to constructing g.

What do you guys think? Can you think of any obvious edge cases where my suggestion wouldn't work?

This is probably related to my question on hw.test in pegas.

With original function, hw.test returns

> hw.test(g)
              chi^2 df Pr(chi^2 >) Pr.exact
INRA21     3.041667  3   0.3852455    1.000
AHT137    24.500000 21   0.2694679    0.488
REN169D01        NA NA          NA       NA
AHTk253    5.541667  3   0.1361676    0.247

but when I make the suggested edit to the df2genind function, it yields

              chi^2 df Pr(chi^2 >) Pr.exact
INRA21     3.041667  3   0.3852455    1.000
AHT137    24.500000 21   0.2694679    0.519
REN169D01  4.850000  3   0.1831162    1.000
AHTk253    5.541667  3   0.1361676    0.272

The text was updated successfully, but these errors were encountered:

I also added a test function and check that coerces ind.names to character

romunov · 2015-10-17T11:07:24Z

I have pushed suggested edits to fix_df2genind branch. Next to what I suggest in this issue (changing line 289) I also added another check to coerce ind.names to character and edited the documentation for this parameter. R CMD check passes ok, will see what Travis has to say.

thibautjombart · 2015-10-20T13:21:26Z

Thanks! I am just seeing this now. It seems to make sense to me at first glance, but as it is a core functionality we need to triple check everything. Can you add this toy example (or another) as a unit test using testthat?

romunov · 2015-10-20T13:31:41Z

Done.

thibautjombart · 2015-10-20T13:35:49Z

Smashing.

solves #96

thibautjombart · 2015-10-20T13:37:30Z

Merged with master now. Thanks for the fix! =D

romunov mentioned this issue Oct 15, 2015

when would hw.test produce NA? emmanuelparadis/pegas#13

Closed

romunov added a commit that referenced this issue Oct 17, 2015

solves #96

a4f1994

I also added a test function and check that coerces ind.names to character

thibautjombart added the bug label Oct 20, 2015

thibautjombart added a commit that referenced this issue Oct 20, 2015

Merge pull request #98 from thibautjombart/fix_df2genind

71af300

solves #96

thibautjombart closed this as completed Oct 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in re-adding NAs in df2genind with unexpected result #96

bug in re-adding NAs in df2genind with unexpected result #96

romunov commented Oct 15, 2015

romunov commented Oct 17, 2015

thibautjombart commented Oct 20, 2015

romunov commented Oct 20, 2015

thibautjombart commented Oct 20, 2015

thibautjombart commented Oct 20, 2015

bug in re-adding NAs in df2genind with unexpected result #96

bug in re-adding NAs in df2genind with unexpected result #96

Comments

romunov commented Oct 15, 2015

romunov commented Oct 17, 2015

thibautjombart commented Oct 20, 2015

romunov commented Oct 20, 2015

thibautjombart commented Oct 20, 2015

thibautjombart commented Oct 20, 2015