Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in re-adding NAs in df2genind with unexpected result #96

Closed
romunov opened this issue Oct 15, 2015 · 5 comments
Closed

bug in re-adding NAs in df2genind with unexpected result #96

romunov opened this issue Oct 15, 2015 · 5 comments
Labels

Comments

@romunov
Copy link
Collaborator

romunov commented Oct 15, 2015

When a table of alleles is constructed in df2genind (towards the end), NAs are reinserted. I suspect that part of evaluating a subsetted object is coercing of factors of characters with "numeric" names (e.g. "720"), NAs are not added to corresponding rows but rather to sequential number the factor level is assigned. First two examples and a small proof of concept that I hope are easy to follow:

Take for example this small example from ?df2genind.

library(adegenet)
df <- data.frame(locusA=c("11","11","12","32"),
                 locusB=c(NA,"34","55","15"),locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)

          locusA locusB locusC
genotype1     11   <NA>     22
genotype2     11     34     22
genotype3     12     55     21
genotype4     32     15     22

obj <- df2genind(df, ploidy=2, ncode=1)
y <- tab(obj)

df["genotype1", ] # NAs on locusB, as expected
          locusA locusB locusC
genotype1     11   <NA>     22
df[rownames(df) == "genotype1", grepl("locusB", colnames(df))] # value on locusB -- NA, OK
[1] <NA>
Levels: 15 34 55
y[rownames(y) %in% "genotype1", grepl("locusB", colnames(y))] # values on locusB -- NAs, OK
locusB.3 locusB.4 locusB.5 locusB.1 
      NA       NA       NA       NA 

All is fine as rownames cannot be coerced to numeric.

Then take this small dataset. Bear with me.

slc <- structure(list(AnimalID = structure(c(6L, 1L, 4L, 5L, 3L, 2L), .Label = c("498", 
"533", "544", "677", "678", "730"), class = "factor"), Samp = structure(c(6L, 
1L, 4L, 5L, 3L, 2L), .Label = c("AP.07P4", "AP.07UM", "AP.07XM", 
"AP.088P", "AP.088T", "AX.0CCE"), class = "factor"), INRA21 = structure(c(3L, 
1L, 2L, 4L, 4L, 3L), .Label = c("092 092", "092 096", "092 098", 
"096 098"), class = "factor"), AHT137 = structure(c(3L, 1L, 6L, 
2L, 4L, 5L), .Label = c("124 142", "124 148", "132 132", "134 146", 
"134 148", "140 146"), class = "factor"), REN169D01 = structure(c(NA, 
4L, 3L, 2L, 1L, 2L), .Label = c("198 198", "198 204", "204 204", 
"204 208"), class = "factor"), AHTk253 = structure(c(3L, 1L, 
1L, 1L, 2L, 2L), .Label = c("280 280", "280 286", "284 286"), class = "factor")), .Names = c("AnimalID", 
"Samp", "INRA21", "AHT137", "REN169D01", "AHTk253"), row.names = 12:17, class = "data.frame")

   AnimalID    Samp  INRA21  AHT137 REN169D01 AHTk253
12      730 AX.0CCE 092 098 132 132      <NA> 284 286
13      498 AP.07P4 092 092 124 142   204 208 280 280
14      677 AP.088P 092 096 140 146   204 204 280 280
15      678 AP.088T 096 098 124 148   198 204 280 280
16      544 AP.07XM 096 098 134 146   198 198 280 286
17      533 AP.07UM 092 098 134 148   198 204 280 286

g <- df2genind(X = slc[, !grepl("AnimalID|Samp", colnames(slc))], sep = " ", 
               ind.names = slc$AnimalID, ncode = 6, NA.char = NA)

slx <- tab(g)

slc[slc$AnimalID %in% c("730"), grepl("REN169D01", colnames(slc))] # individual "730" has NAs on locu REN169D01
[1] <NA>
Levels: 198 198 198 204 204 204 204 208

slx[rownames(slx) %in% c("730"), grepl("REN169D01", colnames(slx))] # these values should be NAs, but are 0
REN169D01.204 REN169D01.208 REN169D01.198 
            0             0             0 

slx[, grepl("REN169D01", colnames(slx))] # notice that NAs have been erroneously assigned to individual 533
    REN169D01.204 REN169D01.208 REN169D01.198
730             0             0             0
498             1             1             0
677             2             0             0
678             1             0             1
544             0             0             2
533            NA            NA            NA

I believe line 289 is responsible for this. NA.ind[i] is coerced to numeric and we all know what happens when we coerce a factor to numeric without an intermediate character step. Here's a proof of concept of what I think is going on.

mydf <- data.frame(a = c("aa", "bb", "cc"), 
                   b = c("dd", "ee", "ff"))
rownames(mydf) <- c("10", "11", "12")

mydf
    a  b
10 aa dd
11 bb ee
12 cc ff

rn <- factor(c("12", "11", "10"), levels = c("11", "12", "10"))
rn <- rn[1]
i <- 1
# rn[1] is 12

rn[1]
[1] 12
Levels: 11 12 10

# however rowname "12" is not selected, but rather row number two
mydf[rn[i], c("a", "b")]
    a  b
11 bb ee

# I think that during evaluation of `rn[1]` factor("12") is being coerced to numeric with respect to the rest of the factor levels
as.numeric(rn[1])
[1] 2

#which subsets the second row named "11"

One solution would be to replace line 289 with

out[rownames(out) %in% NA.ind[i], grep(loc, out.colnames)] <- NA

The issue can be also avoided if

slc$AnimalID <- paste("animol", 1:nrow(slc))

is called prior to constructing g.

What do you guys think? Can you think of any obvious edge cases where my suggestion wouldn't work?

This is probably related to my question on hw.test in pegas.

With original function, hw.test returns

> hw.test(g)
              chi^2 df Pr(chi^2 >) Pr.exact
INRA21     3.041667  3   0.3852455    1.000
AHT137    24.500000 21   0.2694679    0.488
REN169D01        NA NA          NA       NA
AHTk253    5.541667  3   0.1361676    0.247

but when I make the suggested edit to the df2genind function, it yields

              chi^2 df Pr(chi^2 >) Pr.exact
INRA21     3.041667  3   0.3852455    1.000
AHT137    24.500000 21   0.2694679    0.519
REN169D01  4.850000  3   0.1831162    1.000
AHTk253    5.541667  3   0.1361676    0.272
romunov added a commit that referenced this issue Oct 17, 2015
I also added a test function and check that coerces ind.names to character
@romunov
Copy link
Collaborator Author

romunov commented Oct 17, 2015

I have pushed suggested edits to fix_df2genind branch. Next to what I suggest in this issue (changing line 289) I also added another check to coerce ind.names to character and edited the documentation for this parameter. R CMD check passes ok, will see what Travis has to say.

@thibautjombart
Copy link
Owner

Thanks! I am just seeing this now. It seems to make sense to me at first glance, but as it is a core functionality we need to triple check everything. Can you add this toy example (or another) as a unit test using testthat?

@romunov
Copy link
Collaborator Author

romunov commented Oct 20, 2015

Done.

@thibautjombart
Copy link
Owner

Smashing.

@thibautjombart
Copy link
Owner

Merged with master now. Thanks for the fix! =D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants