Skip to content

Do Distinct Subgroups of Patients Exist in the Knee Osteoarthritis Cohort?

Notifications You must be signed in to change notification settings

tjl0005/Osteoarthrtis-Clustering-Anaylsis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Do Distinct Subgroups of Patients Exist within the Knee Osteoarthritis Cohort?

Motivation and rationale

The Context:

Osteoarthritis is a very common disease that causes cartilage to breakdown within the joints, which in turn causes joint space narrowing. Although it is very common there is still a limited understanding of how the disease progresses. One of the main limitations due to this lack of understanding is the inability to limit the diseases progression and the rate at which cartilage breaks down. Osteo is the most common form of osteoarthritis and can occur due to a multiride of causes: age, obesity, joint injury and family history, which with an ageing population means the disease is only becoming more common.

Problem

The main purpose of this project is to try and find distinct subgroups of patients within the knee osteoarthritis cohort (OAI). This is only a hypothesis as it is unknown if distinct groups will exist within the data. Clustering data from the OAI has been performed before and been successful, these findings theorised to of been potentially useful for various reasons. Clustering is the method to try and find these distinct subgroups because mentioned it is unknown of these groups exist, so we cannot use classification (another method for categorising) with the data because the labels are unknown. In this project a variety of clustering types will be explored as well as various processing approaches to prepare the data for clustering.

Aim and objectives

Aim: Analyse data from patient progression profiles to find common subtypes of diseases and visualise the findings

Objectives:

  1. Understand the data present in the progression profiles
    The dataset consists of many different measurements and so understanding the purpose and relevance of these measurements prevents misunderstandings when processing the data and drawing conclusions from the findings.
  2. Effectively process the data for clustering
    The data in its initial state cannot be clustered in attempt to find distinct subgroups of patients. A new dataset containing the differences needs to be produced to show the structural progression for the patients, then this dataset needs to be dimensionally reduced.
  3. Evaluate clustering algorithms results using metrics
    Clusters cannot be assessed using just one method such as visualisations because they can be very misleading so an effective method using both visuals and various metrics need to be used.
  4. Optimise algorithms by tuning the parameters
    Each algorithm explored has many different parameters all of which effect the produced clusters and their effectiveness. Parameter sweeps will be used in combination with metrics to optimise the algorithms.
  5. Visualise clusters in an informative manner
    Findings need to be presented in a meaningful way, the visualisations of clusters will vary for each type as they can produce different shapes and contain different types of points such as the centroids in K-Means.

Background Research

Source Description Relevance
Osteoarthritis: toward a comprehensive understanding of pathological mechanism [4] This paper provides a lot of background information on Osteoarthritis, including Its effects and causes. It outlines areas such as the progression of the disease as being poorly understood. As a result of the lack of understanding, there’s no method to deaccelerate the progression of the disease. Reading the paper conveys the seriousness of the disease and further solidifies how common it is. Learning about the disease will greatly affect my work as well as my effectiveness in understanding and analysing the data. This paper also shows the need for further research on this topic.
Patients with knee osteoarthritis can be divided into subgroups [5] This article shows an example where the osteoarthritis cohort can be divided into subgroups using a clustering algorithm. These subgroups showed the patients could be divided by specific clinical characteristics and how these findings can further the understanding of development, potentially inspiring improved treatment strategies. This shows an example where clustering knee osteoarthritis cohort data can be used to further our understanding of the disease and highlight distinct groups. Furthermore, Python was used to produce clustering with K-means, the article explains how they implemented the algorithm, including the number of repetitions and how they found the number of subgroups to use. I am intending to use K-means so this methodology is very useful to learn about.
Distinct subtypes of knee osteoarthritis: data from the Osteoarthritis Initiative [6] This shows how the osteoarthritis initiative cohort is a collection of distinct subtypes of osteoarthritis through the use of clustering analysis. The clusters showed that different causes can lead to different types of knee osteoarthritis. Reading this furthered my understanding of the disease, showing how clusters can be used in comparison with data to draw further conclusions. It also mentions various methods that can be used including K-means, which I intend to use, and other methods such as LCA.
Types of Clustering Algorithms [7] This discusses four types of clustering algorithms that can be used, all of which contain their own variants but the focus will be on these types included. Each variant produces very different visuals of clusters showing how different they are and the need to explore different types rather than variants. Clustering will be a big part of the project, I will produce many variations of clusters using clustering algorithms all with various parameters. These are the most common algorithms and will be the focus of my experiments in clustering. I am intending to use a few clustering types which are explored in this article, one of which is Density-Based clustering. This type is good in this scenario because the data does not have high dimensions and this algorithm does not assign outliers to clusters, which will prevent visualisations from being misleading. I can implement this type of algorithm with Scikit-learn in two ways which are through DBSCAN and OPTICS, both produce very similar results but the latter produces a reachability graph. This graph could be useful in assessing the effectiveness of the algorithm and also further my understanding of the results.
Clustering Algorithms: A Comparative Approach [8] This paper gives background information to clustering algorithms and assesses how common types perform. It shows how performance can vary between the types of data tested and the parameters of each algorithm type. It makes three assessments for each type tested, which are using the default parameters and single or random parameter changes. I need to assess various types of clustering algorithms all with their own parameters, this provides ideas on how to vary the parameters for testing the algorithms quickly and uniformly. It also outlines some common clustering methods and gives some explanations about how they work and what their parameters represent. One of the algorithms outlined in this article is hierarchical clustering which I am intending to use. It found this type performed well on smaller datasets, was the fastest tested and had limited performance on larger datasets. There are also two variations which are agglomerative and divisive which are top-down and bottom-up approaches for the clusters, meaning the hierarchy will either start or end with a unique cluster.
What Makes a Visualization Memorable [9] This discusses the importance of visualisations and how they can be easily insufficient in conveying their meaning. It shows measurements of memorability for various visualizations, showing how effective they are. I will have to visualise my findings and clusters in a meaningful way, in which I mean one where the data can be easily understood. Each of the clustering algorithms will produce different visuals so I need to ensure that when presenting these findings, they stand out from one another.

Work plan

Done so far:

Firstly, I have done background research on a few of the topics I will be encountering throughout. The main one is based around clustering algorithms, these will be a big part of the project represented in objectives 3 and 4. I have read an article from Google Developers [6], which summarised some of the common types and visualised example results. From this article I concluded a distribution-based algorithm such as DBSCAN or OPTICS would be a good choice, which can be done with Scikit-Learn.

Furthermore, I have looked at a paper discussing the performances of different types [7], which gave some in-depth comparisons and experiments of common clustering algorithms. This solidified some of my understanding produced from the prior article, as well as explaining some of the workings of the algorithms and their parameters. As a result, I am confident in the usage of Hierarchal clustering and I will do some brief testing between divisive and agglomerative clustering.

I will be using Scikit-Learn to implement the algorithms because this package enables me to implement the mentioned clustering algorithms and useful metrics to assess the clusters. I will also be using Pandas and Seaborn as I have experience with both and they are well suited to the task, Pandas will be used for processing the data and preparing it for implementing the algorithms and Seaborn will be used for visualisations, I chose this package because it is designed around Pandas.

Future Plans:

To start with I will solidify my understanding of the data. This will be crucial for the next steps because if I lack this understanding, I will be very prone to making mistakes which will delay my progress. Following this I will begin to integrate the data, which produces a new view of the data enabling me to draw some initial conclusions.

At this point I will begin implementing the clustering algorithms and I will evaluate the following types: K-means, Density-Based and Hierarchal. I will repeat each experiment although this number is not yet known as I will not know how long they will take until I have begun this phase. I will have to evaluate each algorithm with different parameter configurations. I intend to perform parameter sweeps for each of the algorithm types to find the optimal configuration. I will know if configurations are appropriate because the results will be stable and be similar to past results from other algorithm types. Each of these configurations will be evaluated with Silhouette Coefficient, Calinski-Harabasz Index and Davies-Bouldin Index. I will be using these metrics because they do not require the ground truth to be known. However, I will have to be aware each of these metrics tends to score higher for Density-Based cluster.

Finally, I will produce high-quality visualisations of the final clusters, these visualisations will show which distinct patient subgroups exist within the Osteoarthritis Initiative cohort. This is important as I will likely have to produce multiple versions of the visualisations to ensure they are informative.

Risks and Mitigation

Each phase of the work is dependent on one another as shown in the Gantt chart, meaning if I fall behind on one task, I lose time to complete all those that follow. The main risk comes from the clustering algorithm tasks, this is because I will be performing many experiments but I will not know how long they will take until beginning them. However, I should be able to recover from these delays as weekends are unplanned times. Furthermore, experiments should become easier to set up as I progress through the phase and so I will be actively gaining time. Although if this issue does occur and I do not believe I will be able to make up for the lost time I can simply reduce the number of experiments.

References

[1] NHS (2019). Overview - Osteoarthritis. NHS. Available at: https://www.nhs.uk/conditions/osteoarthritis/.
[2] CDC (2020). Osteoarthritis (OA). Centers for Disease Control and Prevention. Available at: https://www.cdc.gov/arthritis/basics/osteoarthritis.htm.
[3] Scikit-Learn.org. (2010). 2.3. Clustering — Scikit-Learn 0.20.3 documentation. Available at: https://Scikit-Learn.org/stable/modules/clustering.html.
[4] Chen, D., Shen, J., Zhao, W., Wang, T., Han, L., Hamilton, J.L. and Im, H.-J. (2017). Osteoarthritis: toward a comprehensive understanding of pathological mechanism. Bone Research. Available at: https://doi.org/10.1038/boneres.2016.44
[5] Petersen E.T., Rytter S., Koppens D., Dalsgaard J., Hansen T.B., Larsen N.E., Andersen M.S. and Stilling M. (2022). Patients with knee osteoarthritis can be divided into subgroups based on tibiofemoral joint kinematics of gait. Available at: https://doi.org/10.1016/j.joca.2021.10.011.
[6] Waarsing, J.H., Bierma-Zeinstra, S.M.A. and Weinans, H. (2015). Distinct subtypes of knee osteoarthritis: data from the Osteoarthritis Initiative. Available at: https://doi.org/10.1093/rheumatology/kev100
[7] Google Developers. (2015). Clustering Algorithms | Clustering in Machine Learning. Available at: https://developers.google.com/machine-learning/clustering/clustering-algorithms.
[8] Rodriguez, M.Z., Comin, C.H., Casanova, D., Bruno, O.M., Amancio, D.R., Costa, L. da F. and Rodrigues, F.A. (2019). Clustering algorithms: A comparative approach. Available at: https://doi.org/10.1371/journal.pone.0210236
[9] Borkin, M.A., Vo, A.A., Bylinskii, Z., Isola, P., Sunkavalli, S., Oliva, A. and Pfister, H. (2013). What Makes a Visualization Memorable? IEEE Transactions on Visualization and Computer Graphics. Available at https://doi.org/10.1109/TVCG.2013.234

About

Do Distinct Subgroups of Patients Exist in the Knee Osteoarthritis Cohort?

Resources

Stars

Watchers

Forks

Languages