Bug Report: using chi² whereas Fisher is expected #1513

oliviercailloux · 2023-05-22T18:57:17Z

I apologize in advance if I missed something again, but the following code seems to behave differently than documented.

t <- tibble(type = character(), answer = character()) %>% 
  add_row(uncount(tibble(type = "A", answer = "C1"), 5)) %>% 
  add_row(uncount(tibble(type = "B", answer = "C1"), 10)) %>% 
  add_row(uncount(tibble(type = "A", answer = "C2"), 100)) %>% 
  add_row(uncount(tibble(type = "B", answer = "C2"), 305)) %>% 
  add_row(uncount(tibble(type = "A", answer = NA), 400)) %>% 
  add_row(uncount(tibble(type = "B", answer = NA), 300))
t %>% tbl_summary(by=type) %>% add_p()

The doc states that "Tests default to (...) "chisq.test.no.correct" for categorical variables with all expected cell counts >=5, and "fisher.test" for categorical variables with any expected cell count <5." Here, cell A/C1 has expected count 15 * 105 / 420 = 15 / 4 < 4 but gtsummary uses a chi squared test (as it indicates in the footnote, and as appears in the warning sent by chi squared test which is unhappy about the approximation).

ddsjoberg · 2023-05-23T03:52:13Z

I think you calculated your expected counts incorrectly

library(gtsummary)

ttt <- tibble::tibble(type = character(), answer = character()) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C1"), 5)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C1"), 10)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C2"), 100)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C2"), 305)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = NA), 400)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = NA), 300))

ttt |> 
  tbl_cross(statistic = "{p}%") |> 
  as_kable()

	C1	C2	Unknown	Total
type
A	0.4%	8.9%	36%	45%
B	0.9%	27%	27%	55%
Total	1.3%	36%	63%	100%

# expected count
0.013 * 0.45 * nrow(ttt)
#> [1] 6.552

^{Created on 2023-05-22 with reprex v2.0.2}

oliviercailloux · 2023-05-23T09:10:09Z

Ah, you consider the expected counts by including the NA (Unknown) entries, contrary to me. This is as if NA was another category, treated on par with the C1 and C2 categories.

Considering NA as yet another category certainly makes sense in some situations, but then shouldn’t the chi² test itself also consider NA as a category? Currently, gtsummary seems to apply chi² as documented, thus, “cases with missing values are removed”. I find it therefore surprising that the expected counts consider NA as a category.

Note that a supplementary side effect of this treatment is that the chisq.test function warns about a possibly invalid approximation.

To put it otherwise, it seems reasonable for the user to expect that the test should behave as if the NA were removed before applying the test, but this expectation is currently not matched.

ddsjoberg · 2023-05-23T12:23:53Z

I am somewhat confused. There are no NAs in the table....

oliviercailloux · 2023-05-23T12:58:40Z

I am confused as well, then. I meant to refer to the r value NA, which appears here, for example: tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = NA), 300)). It is displayed as “Unknown” in the table you showed.

ddsjoberg · 2023-05-23T13:14:35Z

you're 100% correct, there are missing values, apologies. When the NAs are removed, the expected counts are still above 5

ddsjoberg · 2023-05-23T13:16:08Z

library(gtsummary)
#> #Uighur

ttt <- tibble::tibble(type = character(), answer = character()) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C1"), 5)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C1"), 10)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = "C2"), 100)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = "C2"), 305)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "A", answer = NA), 400)) %>% 
  tibble::add_row(tidyr::uncount(tibble::tibble(type = "B", answer = NA), 300))

ttt |> 
  tbl_cross(statistic = "{p}%", missing = "no") |> 
  as_kable()
#> FALSE observations with missing data have been removed.

	C1	C2	Total
type
A	1.2%	24%	25%
B	2.4%	73%	75%
Total	3.6%	96%	100%

0.036 * 0.25 * nrow(ttt)
#> [1] 10.08

^{Created on 2023-05-23 with reprex v2.0.2}

ddsjoberg · 2023-05-23T13:31:11Z

(and the NAs are removed before making the expected count assessment)

oliviercailloux · 2023-05-23T13:39:12Z

0.036 * 0.25 * nrow(ttt)
#> [1] 10.08

That’s with nrow(ttt) = 1120. With the removed NAs, the total count is 420.

ddsjoberg · 2023-05-23T14:06:01Z

oh goodness, i have been giving half attention to the details here, apologies....i'll investigate!

ddsjoberg · 2023-05-28T03:33:48Z

@oliviercailloux thank you so much for reporting this! the bug occurred when there was a large number of missing in one variable relative to the other as the rates we being calculated separately, rather than a complete case estimation by both variables. again, thank you!

ddsjoberg mentioned this issue May 27, 2023

Correction in expected count calculation #1516

Merged

15 tasks

ddsjoberg closed this as completed in #1516 May 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: using chi² whereas Fisher is expected #1513

Bug Report: using chi² whereas Fisher is expected #1513

oliviercailloux commented May 22, 2023

ddsjoberg commented May 23, 2023

oliviercailloux commented May 23, 2023 •

edited

Loading

ddsjoberg commented May 23, 2023

oliviercailloux commented May 23, 2023

ddsjoberg commented May 23, 2023 •

edited

Loading

ddsjoberg commented May 23, 2023

ddsjoberg commented May 23, 2023

oliviercailloux commented May 23, 2023

ddsjoberg commented May 23, 2023

ddsjoberg commented May 28, 2023

Bug Report: using chi² whereas Fisher is expected #1513

Bug Report: using chi² whereas Fisher is expected #1513

Comments

oliviercailloux commented May 22, 2023

ddsjoberg commented May 23, 2023

oliviercailloux commented May 23, 2023 • edited Loading

ddsjoberg commented May 23, 2023

oliviercailloux commented May 23, 2023

ddsjoberg commented May 23, 2023 • edited Loading

ddsjoberg commented May 23, 2023

ddsjoberg commented May 23, 2023

oliviercailloux commented May 23, 2023

ddsjoberg commented May 23, 2023

ddsjoberg commented May 28, 2023

oliviercailloux commented May 23, 2023 •

edited

Loading

ddsjoberg commented May 23, 2023 •

edited

Loading