Accountability Accounting, a prominent investment bank, is interested in offering a new cryptocurrency investment portfolio for its customers. The company, however, is lost in the vast universe of cryptocurrencies. My job is to create a report that includes what cryptocurrencies are on the trading market and how they could be grouped to create a classification system for this new investment. The data I was given is not in an ideal format for my algorithms, so it will need to be processed to fit the machine learning models. Since there is no known output for what the company is looking for, I will use unsupervised learning.
- Using my knowledge of Pandas, I preprocessed the dataset to perform PCA.
- Using my knowledge of PCA (Principal Component Analysis) algorithm, I reduced the dimensions of the X DataFrame to three principal components and placed these dimensions into a new DataFrame.
- Next, I clustered the cryptocurrencies using the K-Means algorithm.
- Finally, using my knowledge of creating scatter plots with Plotly Express and hvplot, I visualized the distinct groups that corresponded to the three principal components.
- Software: Python 3.9.7 and Jupyter Notebook
- Data: crpto_data.csv
- K-Means: The K-means algorithm groups the data into K clusters, where belonging to a cluster is based on some similarity or distance measure to a centroid.
- PCA: PCA is a statistical technique to speed up machine learning algorithms when the number of input features (or dimensions) is too high. The technique reduces the number of dimensions by transforming a large set of variables into a smaller one that contains most of the information in the original large set.
After removing non-tradable currencies, null values, and the "IsTrading" column, I created a new DataFrame that holds all of the crypto names. The new DataFrame consisted of 532 rows = 532 tradable cryptocurrencies on the market at that time.
Then I used the K-means algorithm to cluster the cryptocurrencies using the PCA data. The following steps took place:
- An elbow curve was created using hvPlot to find the best value for K.
- Predictions were made on the K clusters of the cryptocurrencies’ data.
- A new DataFrame was created with the same index as the crypto_df DataFrame and had the following columns: Algorithm, ProofType, TotalCoinsMined, TotalCoinSupply, PC 1, PC 2, PC 3, CoinName, and Class.
Finally, I created a new table with the tradable cryptocurrencies. The total number of tradable cryptocurrencies = 532
As you can see from the final 2D-Scatter plot, most of the clusters are overlapping and not quite forming the distincts groups as we had hoped. That's why I also created a 3D graph to help better visualize each group, providing three distinct groups that correspond to the three clusters that we expect the model to break the data into.
- The 3D scatter plot can be rotated using the mouse to click and drag using the scroll wheel. Try hovering over a unique point to receive information such as each principal component.