Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New plot methods for check_outliers (?) #262

Open
rempsyc opened this issue Jan 20, 2023 · 4 comments
Open

New plot methods for check_outliers (?) #262

rempsyc opened this issue Jan 20, 2023 · 4 comments
Labels
Enhancement 💥 New feature or request

Comments

@rempsyc
Copy link
Member

rempsyc commented Jan 20, 2023

This is regarding the check_outliers() paper for the journal Mathematics (easystats/performance#544). I wonder if we should add new plot methods to include in the article submission (deadline is Feb 23). I explain in detail below.


Model-based outliers

For model-based outliers, see has an awesome plotting method:

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
model <- lm(disp ~ mpg * hp, data = data)
x <- check_outliers(model, method = "cook")
plot(x)

Created on 2023-01-20 with reprex v2.0.2

Multiple methods

For multiple methods, we have no choice but to standardize the distance scores if we want to plot them on the same scale so I think the current solution is pretty satisfying.

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
model <- lm(disp ~ mpg * hp, data = data)
x <- check_outliers(data, method = c("zscore_robust", "iqr", "mcd", "lof"))
plot(x)

Created on 2023-01-20 with reprex v2.0.2

Multivariate methods

For a single multivariate method, I think it is ok-ish. Could be a lot of work to do a custom plotting method for each multivariate method so I think this is fine. But the x-axis is hard to read since the numbers overlap (so imagine with big data sets).

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
model <- lm(disp ~ mpg * hp, data = data)
x <- check_outliers(data, method = "mcd")
plot(x)

Created on 2023-01-20 with reprex v2.0.2

For the Mahalanobis method specifically, one colleague believes their own custom plot is more useful (and would be happy to see it implemented within easystats):

data <- rbind(mtcars[1:4], 42, 55)
data <- cbind(car = row.names(data), data)

mahaout <- function (dataset, vars, idvar) {
  maha <- as.data.frame(na.omit(dataset[, c(idvar, vars)]))
  maha$values <- mahalanobis(na.omit(dataset[, vars]),
                             colMeans(na.omit(dataset[,vars]), na.rm=T),
                             cov(na.omit(dataset[, vars]), use = "p"))
  crit <- qchisq(0.999, df = ncol(dataset[, vars]))
  plot(sort(maha$values),
       xlab = "Observations", ylab = "Mahalanobis values")
  abline(h = crit, col = "darkred")
  outliers <- maha[which(maha$values > crit), idvar]
  return(outliers)
}

mahaout(data, vars = names(data[-1]), idvar="car")

#> [1] "34"

Created on 2023-01-20 with reprex v2.0.2

Should we have something like that for method = mahalanobis and similar ones? The guiding principle could be: plotting distance of individual observations + line at chosen threshold. If we do this it might not be that much work since the actual distances and thresholds are already accessible as attributes, so it would make for very consistent plotting.

Lakens's Method

Edit: forgot to add this other example: Alternatively, we have the plot outlier method from the Daniel Lakens's outliers paper (Leys et al. (2019)).

library(Routliers)

data <- rbind(mtcars[1:4], 42, 55)
res <- outliers_mcd(x = data)
plot_outliers_mcd(res, x = data)

Created on 2023-01-20 with reprex v2.0.2

Univariate methods

Let me give you another example of mine for univariate outliers. Currently, we have the same boring plot for method = zscore_robust for instance.

library(performance)
library(see)

data <- rbind(mtcars[1:4], 42, 55)
x <- check_outliers(data, method = "zscore_robust")
plot(x)

Created on 2023-01-20 with reprex v2.0.2

But I was imagining that perhaps it would be useful to use something like this for zscores:

library(rempsyc)

data <- rbind(mtcars[1:4], 42, 55)
plot_outliers(data, response = "mpg", method = "sd", criteria = 3)

Created on 2023-01-20 with reprex v2.0.2

And something similar for robust zscores, but for several variables we could also wrap it in a panel:

library(rempsyc)
library(see)
data <- rbind(mtcars[1:4], 42, 55)
plots(lapply(names(data), function(x) {
  plot_outliers(data, response = x, ytitle = x, method = "mad", criteria = 3)
}), n_columns = 2)

Created on 2023-01-20 with reprex v2.0.2

Lakens's Method

Edit: forgot to add this other example: Alternatively, we have the plot outlier method from the Daniel Lakens's outliers paper (Leys et al. (2019)).

library(Routliers)

data <- rbind(mtcars[1:4], 42, 55)
res <- outliers_mad(x = data$mpg)
plot_outliers_mad(res, x = data$mpg) 

Created on 2023-01-20 with reprex v2.0.2

Challenges

One possible challenge for univariate method is when applied to several columns. In that case the proposed solution will not work since the rescaled score (0-1) is an aggregate of the score of each column (for single multivariate methods that would not be a problem by definition). So we could implement this when a single method + single column are selected? Unless of course we use lapply with see:plots like in the last example.

@rempsyc rempsyc added the Enhancement 💥 New feature or request label Jan 20, 2023
@DominiqueMakowski
Copy link
Member

I agree that the bar chart might not be the prettiest, BUT it has one major advantage over the others suggestions: it shows the row-name: it makes easy to identify which observation is problematic, which is the goal of check_outliers.

The plots of check_outliers should aim at achieving 3 objectives:

  • Show how much outlying observations "outlie" as compared to the group
  • Explicitly show the threshold used for classification
  • Show the row name of the outliers

I like the idea of a dot-violin plot like
image

It does look better I think too; but perhaps we could further improve it by labelling the points, example using pool points if the row names are numerical or gglabels if characters?

@rempsyc
Copy link
Member Author

rempsyc commented Jan 23, 2023

BUT it has one major advantage over the others suggestions: it shows the row-name: it makes easy to identify which observation is problematic, which is the goal of check_outliers.

True, but check_outliers() already flags them, so I think it is less critical to tag them on the plot compared to if people would only use the plot and not not the check_outliers() output. Plus as mentioned before, even with 30 observations it's hard to read which observations we are looking at. Imagine for +400 data points like in my data sets the current plotting method would be simply unusable.

Edit: here's an example with my current real-world data:

I feel like the point of the plot is more to have an idea of the overall distribution and the edge cases, while ideally using the original threshold and not a rescale transformation. And I agree with the three objectives you mention too. While I believe tagging them is slightly less important, I agree it would be a nice addition (either as default or as an option, with many observations that too can become unwieldy). I just want to confirm with you, would you only tag outliers (as is more common) or all observations (as I understand from poolpoint)?

Finally, I know right now it looks weird to have "All data" as x-axis title but this is because since outliers need to be checked by group if need be, they can be plotted as such as well, e.g.,

library(rempsyc)

data <- rbind(mtcars[1:4], 42, 55)
data$cyl[33:34] <- 6
plot_outliers(data, group = "cyl", response = "mpg")

Created on 2023-01-23 with reprex v2.0.2

Ideally our plot method would detect if check_outliers has a grouped attribute and adjust the plot accordingly.

If you give me the green light to add the poolpoints to the dot-violin plot and add it as a new plotting method (at least for zscore methods), I'll go ahead and start working on that.

@rempsyc
Copy link
Member Author

rempsyc commented Mar 19, 2023

Reviewer 2 in easystats/performance#544

It is necessary to change the x-axis labels of all the figures as the numbers overlap; they should be written in a smaller font or otherwise eliminate some of them since they are not readable. I am talking about the figures referring to the histogram of the outliers.

So I guess we have no choice to change the plotting methods now!! So @DominiqueMakowski @IndrajeetPatil do you approve of the above after all?

@strengejacke
Copy link
Member

For the paper, we can just add ggplot-layers to get the plot we want, unless these are very quick changes. I wouldn't spend too much time on coding, if not really required for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement 💥 New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants