-
-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New plot methods for check_outliers
(?)
#262
Comments
I agree that the bar chart might not be the prettiest, BUT it has one major advantage over the others suggestions: it shows the row-name: it makes easy to identify which observation is problematic, which is the goal of check_outliers. The plots of check_outliers should aim at achieving 3 objectives:
I like the idea of a dot-violin plot like It does look better I think too; but perhaps we could further improve it by labelling the points, example using pool points if the row names are numerical or gglabels if characters? |
True, but Edit: here's an example with my current real-world data: I feel like the point of the plot is more to have an idea of the overall distribution and the edge cases, while ideally using the original threshold and not a rescale transformation. And I agree with the three objectives you mention too. While I believe tagging them is slightly less important, I agree it would be a nice addition (either as default or as an option, with many observations that too can become unwieldy). I just want to confirm with you, would you only tag outliers (as is more common) or all observations (as I understand from poolpoint)? Finally, I know right now it looks weird to have "All data" as x-axis title but this is because since outliers need to be checked by group if need be, they can be plotted as such as well, e.g., library(rempsyc)
data <- rbind(mtcars[1:4], 42, 55)
data$cyl[33:34] <- 6
plot_outliers(data, group = "cyl", response = "mpg") Created on 2023-01-23 with reprex v2.0.2 Ideally our plot method would detect if If you give me the green light to add the poolpoints to the dot-violin plot and add it as a new plotting method (at least for zscore methods), I'll go ahead and start working on that. |
Reviewer 2 in easystats/performance#544
So I guess we have no choice to change the plotting methods now!! So @DominiqueMakowski @IndrajeetPatil do you approve of the above after all? |
For the paper, we can just add ggplot-layers to get the plot we want, unless these are very quick changes. I wouldn't spend too much time on coding, if not really required for the review. |
This is regarding the
check_outliers()
paper for the journal Mathematics (easystats/performance#544). I wonder if we should add new plot methods to include in the article submission (deadline is Feb 23). I explain in detail below.Model-based outliers
For model-based outliers,
see
has an awesome plotting method:Created on 2023-01-20 with reprex v2.0.2
Multiple methods
For multiple methods, we have no choice but to standardize the distance scores if we want to plot them on the same scale so I think the current solution is pretty satisfying.
Created on 2023-01-20 with reprex v2.0.2
Multivariate methods
For a single multivariate method, I think it is ok-ish. Could be a lot of work to do a custom plotting method for each multivariate method so I think this is fine. But the x-axis is hard to read since the numbers overlap (so imagine with big data sets).
Created on 2023-01-20 with reprex v2.0.2
For the Mahalanobis method specifically, one colleague believes their own custom plot is more useful (and would be happy to see it implemented within
easystats
):Created on 2023-01-20 with reprex v2.0.2
Should we have something like that for
method = mahalanobis
and similar ones? The guiding principle could be: plotting distance of individual observations + line at chosen threshold. If we do this it might not be that much work since the actual distances and thresholds are already accessible as attributes, so it would make for very consistent plotting.Lakens's Method
Edit: forgot to add this other example: Alternatively, we have the plot outlier method from the Daniel Lakens's outliers paper (Leys et al. (2019)).
Created on 2023-01-20 with reprex v2.0.2
Univariate methods
Let me give you another example of mine for univariate outliers. Currently, we have the same boring plot for
method = zscore_robust
for instance.Created on 2023-01-20 with reprex v2.0.2
But I was imagining that perhaps it would be useful to use something like this for zscores:
Created on 2023-01-20 with reprex v2.0.2
And something similar for robust zscores, but for several variables we could also wrap it in a panel:
Created on 2023-01-20 with reprex v2.0.2
Lakens's Method
Edit: forgot to add this other example: Alternatively, we have the plot outlier method from the Daniel Lakens's outliers paper (Leys et al. (2019)).
Created on 2023-01-20 with reprex v2.0.2
Challenges
One possible challenge for univariate method is when applied to several columns. In that case the proposed solution will not work since the rescaled score (0-1) is an aggregate of the score of each column (for single multivariate methods that would not be a problem by definition). So we could implement this when a single method + single column are selected? Unless of course we use
lapply
withsee:plots
like in the last example.The text was updated successfully, but these errors were encountered: