Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: using chi² whereas Fisher is expected #1513

Closed
oliviercailloux opened this issue May 22, 2023 · 10 comments · Fixed by #1516
Closed

Bug Report: using chi² whereas Fisher is expected #1513

oliviercailloux opened this issue May 22, 2023 · 10 comments · Fixed by #1516

Comments

@oliviercailloux
Copy link

I apologize in advance if I missed something again, but the following code seems to behave differently than documented.

t <- tibble(type = character(), answer = character()) %>% 
  add_row(uncount(tibble(type = "A", answer = "C1"), 5)) %>% 
  add_row(uncount(tibble(type = "B", answer = "C1"), 10)) %>% 
  add_row(uncount(tibble(type = "A", answer = "C2"), 100)) %>% 
  add_row(uncount(tibble(type = "B", answer = "C2"), 305)) %>% 
  add_row(uncount(tibble(type = "A", answer = NA), 400)) %>% 
  add_row(uncount(tibble(type = "B", answer = NA), 300))
t %>% tbl_summary(by=type) %>% add_p()

The doc states that "Tests default to (...) "chisq.test.no.correct" for categorical variables with all expected cell counts >=5, and "fisher.test" for categorical variables with any expected cell count <5." Here, cell A/C1 has expected count 15 * 105 / 420 = 15 / 4 < 4 but gtsummary uses a chi squared test (as it indicates in the footnote, and as appears in the warning sent by chi squared test which is unhappy about the approximation).

@ddsjoberg
Copy link
Owner

I think you calculated your expected counts incorrectly

library(gtsummary)

ttt <- tibble::tibble(type = character(), answer = character()) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C1"), 5)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C1"), 10)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C2"), 100)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C2"), 305)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = NA), 400)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = NA), 300))

ttt |> 
  tbl_cross(statistic = "{p}%") |> 
  as_kable()
C1 C2 Unknown Total
type
A 0.4% 8.9% 36% 45%
B 0.9% 27% 27% 55%
Total 1.3% 36% 63% 100%
# expected count
0.013 * 0.45 * nrow(ttt)
#> [1] 6.552
 

Created on 2023-05-22 with reprex v2.0.2

@oliviercailloux
Copy link
Author

oliviercailloux commented May 23, 2023

Ah, you consider the expected counts by including the NA (Unknown) entries, contrary to me. This is as if NA was another category, treated on par with the C1 and C2 categories.

Considering NA as yet another category certainly makes sense in some situations, but then shouldn’t the chi² test itself also consider NA as a category? Currently, gtsummary seems to apply chi² as documented, thus, “cases with missing values are removed”. I find it therefore surprising that the expected counts consider NA as a category.

Note that a supplementary side effect of this treatment is that the chisq.test function warns about a possibly invalid approximation.

To put it otherwise, it seems reasonable for the user to expect that the test should behave as if the NA were removed before applying the test, but this expectation is currently not matched.

@ddsjoberg
Copy link
Owner

I am somewhat confused. There are no NAs in the table....

@oliviercailloux
Copy link
Author

I am confused as well, then. I meant to refer to the r value NA, which appears here, for example: tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = NA), 300)). It is displayed as “Unknown” in the table you showed.

@ddsjoberg
Copy link
Owner

ddsjoberg commented May 23, 2023

you're 100% correct, there are missing values, apologies. When the NAs are removed, the expected counts are still above 5

@ddsjoberg
Copy link
Owner

library(gtsummary)
#> #Uighur

ttt <- tibble::tibble(type = character(), answer = character()) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C1"), 5)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C1"), 10)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C2"), 100)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C2"), 305)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = NA), 400)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = NA), 300))

ttt |> 
  tbl_cross(statistic = "{p}%", missing = "no") |> 
  as_kable()
#> FALSE observations with missing data have been removed.
C1 C2 Total
type
A 1.2% 24% 25%
B 2.4% 73% 75%
Total 3.6% 96% 100%
0.036 * 0.25 * nrow(ttt)
#> [1] 10.08

Created on 2023-05-23 with reprex v2.0.2

@ddsjoberg
Copy link
Owner

(and the NAs are removed before making the expected count assessment)

@oliviercailloux
Copy link
Author

0.036 * 0.25 * nrow(ttt)
#> [1] 10.08

That’s with nrow(ttt) = 1120. With the removed NAs, the total count is 420.

@ddsjoberg
Copy link
Owner

oh goodness, i have been giving half attention to the details here, apologies....i'll investigate!

@ddsjoberg
Copy link
Owner

@oliviercailloux thank you so much for reporting this! the bug occurred when there was a large number of missing in one variable relative to the other as the rates we being calculated separately, rather than a complete case estimation by both variables. again, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants