List of new potential statistical modules #1085

m7pr · 2024-02-27T11:00:25Z

m7pr
Feb 27, 2024
Collaborator

Motivation

Expanding the set of statistical analysis modules within teal framework holds immense promise for both researchers and practitioners in the medical field. By augmenting our existing toolkit with a broader array of statistical modules, we directly address the evolving needs and complexities encountered in modern clinical research. One of the primary motivations behind this expansion is to enhance the versatility and comprehensiveness of teal, enabling users to tackle a wider range of research questions and study designs with greater precision and efficiency.

New Modules

Clinical Analysis
- accelerated failure time models (aka AFT models) - when Cox model assumptions are not met, you need to use different types of analysis to analyze survival time, AFT family brings a lot of parametric models to help with: Weibull, exponential, and log-logistic models,
- multi-variate cox regression model - currently tm_t_coxreg() provides analysis for a single variable, we can extend that for a multi-variable model,
- competing risk analysis - an extension to Cox model which is relevant when studying events where multiple outcomes are possible,
New Classification Modules
- logistic regression (already implemented),
- probit regression,
- SVM,
- decision trees & random forests,
- knn,
- naive bayes,
- neural networks,
- LDA+QDA,
- Ensemble Models (Bagging / AdaBoost)
Clustering Modules
- K-means clustering,
- Agnes clustering,
- NMF (Non-negative Matrix Factorization) clustering,
- Daisy clustering,
- Hierarchical clustering
Other Dimensionality Reduction Techniques
- PCA (already implemented),
- t-sne,
- SVD,
- Autoencoders,
- NMF (Non-negative Matrix Factorization),
- Factor Analysis
Statistical Tests
- T-student test,
- Wilcoxon Rank-Sum (Mann-Whitney U) test,
- Kolmogorov–Smirnov test,
- Chi-square test,
- Kruskal-Wallis test,
- Fisher's exact test,
- tests for Correlation detection/testing

Summary

Using multiple statistical models for investigating the same research hypothesis offers several compelling advantages that can enhance the robustness, reliability, and interpretability of study findings:

diverse perspectives
- Different statistical models often make distinct assumptions about the underlying data distribution, relationships between variables, and model complexity. By employing multiple models, researchers gain insights from various perspectives, allowing them to explore the hypothesis comprehensively and capture nuances that may be overlooked by a single model.
validation and sensitivity analysis
- Employing multiple models facilitates cross-validation and sensitivity analysis, wherein the consistency of results across different methodologies is assessed. Consistent findings across diverse models strengthen confidence in the validity of results, whereas discrepancies may highlight areas of uncertainty or model dependence that warrant further investigation.
risk mitigation
- No statistical model is immune to assumptions or limitations, and the choice of model may influence study conclusions. Using multiple models mitigates the risk associated with relying solely on one approach, as it reduces the impact of model-specific biases or deficiencies. If results are consistent across different models, it provides greater assurance that findings are robust and not driven solely by idiosyncrasies of a particular modeling framework.
comprehensive inference
- Each statistical model may provide unique insights or additional information about the research hypothesis. For instance, while one model may focus on estimating mean differences between groups, another may emphasize the strength of associations or predictive performance. Integrating findings from multiple models enriches the understanding of the phenomenon under investigation, enabling a more comprehensive interpretation of results.
communication and transparency
- Presenting results from multiple statistical models promotes transparency and facilitates communication with diverse stakeholders, including peers, reviewers, and policymakers. It allows researchers to convey the uncertainty inherent in statistical modeling and acknowledge the potential variability in findings, fostering a more nuanced and informed discussion of study implications.

In summary, utilizing multiple statistical models for investigating the same research hypothesis offers numerous benefits, including a more comprehensive understanding of the phenomenon under study, enhanced validation and sensitivity analysis, risk mitigation against model-specific biases, and improved communication of findings. Embracing this approach promotes rigor, robustness, and confidence in research outcomes, ultimately advancing scientific knowledge and informing evidence-based decision-making.

m7pr · 2024-02-27T11:00:59Z

m7pr
Feb 27, 2024
Collaborator Author

CC @lcd2yyz @shajoezhu @donyunardi @pawelru @edelarua @Melkiades

0 replies

shajoezhu · 2024-02-27T11:39:42Z

shajoezhu
Feb 27, 2024
Collaborator

fyi @khatril @telepath37

0 replies

Melkiades · 2024-02-27T14:54:31Z

Melkiades
Feb 27, 2024
Collaborator

Thanks @m7pr for proposing this. I used many of these (mainly clustering and dim reduction), and I think many need to be correctly contextualized. For example, t-SNE is a powerful exploratory tool that would be an amazing variable explorer, but it can be used only on certain distributions and it is not something you can use for general purposes (e.g. using it as dim transform) because it is distorting the data as much as you might want, hence it is not statistically sound (not invertible transf). Also, neural network and regressions overlap significantly in their mathematics and respond to a general need of learning trends on data to be used on unseen data. I guess I am wondering how the current analysis of clinical data reflects the application contexts of some of these methods.

In other words, we need clear case studies and examples of these methods in clinical trials' context; many of the methods listed have diverse application that could not suit clinical data per se. Taking again t-SNE as example, it would be amazing to have a pop-up that shows you the distribution, and maybe a voronoi clustering, but it would not be a reliable statistic, so on one side we would spend a lot of time getting to work (it is also computationally expensive for certain parameters) while on the other side could be not super useful to our stakeholders that may not know what it is or how to use it for clinical studies. Probably we need to study how each method is currently used in our context and eventually see if we can make a use case

0 replies

Melkiades · 2024-02-27T14:58:50Z

Melkiades
Feb 27, 2024
Collaborator

btw statistical tests we should have them all for different clinical study settings. Of course, if you meant all of this as an additional suite of on-the-demand tools, it makes complete sense to leverage teal + one package per method (roughly) on any data the user wants to analyze

0 replies

m7pr · 2024-02-27T15:54:15Z

m7pr
Feb 27, 2024
Collaborator Author

Hey @Melkiades I'm not saying this list is complete or every item on the list should stay there. It's just a list of popular tools used in statistical machine learning. I don't discourage or promote any of them. Nor I state I know all clinical data analysis departments around the globe nor I can tell what is the most usable and the most popular and what has the most sense. Probably every researcher focuses on different aspects of data investigation, specifically tied to his hypothesis and his study.

From my experience, what is the most popular methods, not necessarily is the most appropriate method. Many studies conduct Cox proportional hazard model and Survival Curves (Kaplan-Meier estimates of a survival function) just because other researchers do that, not necessarily because this is the most suitable tool for a particular dataset and specific hypothesis.

I personally saw all those methods being applied to clinical data, when I worked for The Cancer Genome Atlas Program.

Every method has it's own purpose and it's own assumptions. I'm not gonna get into details of whether t-SNE is the most usable method, or whether neural networks overlaps with regression or a single regression is just a simple case of a neutral network. It's up to researcher to learn the fundamentals of methods and to use the methods with caution. On our end is the possibility to allow multiple methods in modules, as writing the modules is the hardest part.

I agree that we can rank order the list from the most desired to the least desired methods and then potentially implement in that order. I also think we can promote teal to be used on data outside of the clinical trials area insightsengineering/teal.gallery#149

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of new potential statistical modules #1085

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

List of new potential statistical modules #1085

m7pr Feb 27, 2024 Collaborator

Motivation

New Modules

Summary

Replies: 5 comments

m7pr Feb 27, 2024 Collaborator Author

shajoezhu Feb 27, 2024 Collaborator

Melkiades Feb 27, 2024 Collaborator

Melkiades Feb 27, 2024 Collaborator

m7pr Feb 27, 2024 Collaborator Author

m7pr
Feb 27, 2024
Collaborator

m7pr
Feb 27, 2024
Collaborator Author

shajoezhu
Feb 27, 2024
Collaborator

Melkiades
Feb 27, 2024
Collaborator

Melkiades
Feb 27, 2024
Collaborator

m7pr
Feb 27, 2024
Collaborator Author