You can view the live demo here.
One of the intriguing challenges in the field of astronomy is the prediction of asteroid diameters. Asteroids, celestial bodies orbiting the Sun, come in various shapes and sizes, making the estimation of their diameters a complex task. Over the years, numerous methods and approaches have been developed to tackle this challenge, each striving to outperform its predecessors. I have decided to step outside of my domain into the realm of asteroids, aiming to leverage the power of data-driven algorithms to predict their diameters accurately to a reasonable extent. In doing so, it not only contributes to the field of space science (lol) but also demonstrates the potential of machine learning in solving complex problems in alien domains.
The primary objective of this project is to address the challenging task of asteroid diameter prediction. This objective is pursued with a key intention of benchmarking against a prior work titled Prediction of Asteroid Diameter with the Help of Multi-Layer Perceptron Regressor by Victor Basu. Basu's work encompasses the application of various machine learning algorithms, including XGBoost, Random Forest, ADA Boost and Multi-Layer Perceptron Regressor. Performance evaluation metrics such as mean absolute error, mean squared error, and R-squared score are used to assess the effectiveness of these models.
While I have familiarized myself with the data description, it's worth noting that the column names may not immediately convey their meanings to anyone reading the notebook for the first time. To enhance clarity and facilitate a deeper understanding of the dataset, concise and informative summaries have been included for each column. This addition aims to provide a clearer overview of the dataset, ensuring that anyone can grasp the context and insights more readily.
Column Name | Kaggle Description | Additional Description |
---|---|---|
full_name | Object's full name/designation | Contains the complete designation or name of celestial objects in the dataset, serving as a unique identifier. |
a | Semi-major axis (au) | Represents the size of the object's orbit around the Sun in astronomical units (au). |
albedo | Geometric albedo | Reflectivity of the object's surface, indicating how much sunlight it reflects. |
e | Eccentricity | Indicates how elliptical or circular the object's orbit is, with values close to 1 indicating high eccentricity. |
i | Inclination (deg) | Angle describing the tilt of the object's orbit relative to the solar system's plane. |
q | Perihelion distance (au) | Closest distance between the object and the Sun during its orbit, measured in astronomical units. |
ad | Aphelion distance (au) | Farthest distance between the object and the Sun during its orbit, measured in astronomical units. |
per_y | Orbital period | Time taken for the object to complete one orbit around the Sun, measured in years. |
data_arc | Data arc-span (d) | Duration over which observational data has been collected for the object, measured in days. |
condition_code | Orbit condition code | Code indicating the quality and reliability of the object's orbital data. |
n_obs_used | Number of observations used | Number of observational data points used to calculate the object's orbital parameters. |
H | Absolute Magnitude parameter | Measure of the object's intrinsic brightness or reflectivity, indicating its size and composition. |
diameter | Diameter of asteroid (Km) | Physical size of the asteroid, measured in kilometers. |
rot_per | Rotation Period (h) | Time taken for the object to complete one full rotation around its axis, measured in hours. |
neo | Near Earth Object | Indicates whether the object is classified as a Near Earth Object (NEO), with orbits in close proximity to Earth. |
pha | Physically Hazardous Asteroid | Identifies whether the object is classified as a Physically Hazardous Asteroid (PHA) with the potential to pose a physical threat to Earth. |
moid | Earth Minimum orbit Intersection Distance (au) | Quantifies the closest approach of the object's orbit to Earth's orbit, providing information about potential close encounters with our planet. |
... | ... | ... |
The full column description can be found in the notebook. The question of which feature(s) correlate with asteroid diameter and other related questions have I tried to answer while performing data visualization.
-
It seems most asteroids have a diameter of 2km - 5km while others have up to 939km. In order to avoid a funny-looking plot due to the huge outlier, the data was capped using Tukey’s method before plotting.
Capping the data before plotting allows us to see the underlying distribution of the data - right-skewed (positively skewed). Rightly skewed here means that the majority of the data points are clustered on the left side of the distribution, and there are some larger values on the right side that are pulling the mean to the right.
-
The semi-major axis of an asteroid is one-half of the major axis of the elliptical orbit. It is measured in astronomical units and describes an object's distance from the Sun. From the scatter plot below, the data points are concentrated in certain areas and all I see here is a weak correlation.
-
Another feature we can look at which perhaps provides us with an estimate of diameter is the Minimum orbit intersection distance (MOID). MOID is a measure used in astronomy to assess potential close approaches and collision risks between astronomical objects. It quantifies the closest approach of the object's orbit to Earth's orbit, providing information about potential close encounters with our planet.
What was observed here are three clear clusters with a somewhat linear trend between each cluster. Asteroids in the first cluster seem to have a relatively small diameter. Though we can roughly conclude that asteroids in the third cluster surely have a higher diameter, it is worth keeping into consideration that a lot of asteroids in the second cluster despite having a lower EMOID, have a significantly high diameter. My takeaway from here is probably taking advantage of clustering algorithms such as K-Means which can capture this relationship.
These insights demonstrate the advantage of doing EDA and not only looking at correlation coefficients or scatter matrix.
The performance of all models used in the notebook is given below
Model | R2 Score | Adjusted R2 Score | RMSE | MAE |
---|---|---|---|---|
Random Forest Regressor | 0.961945 | 0.961905 | 0.473486 | 0.305354 |
LightGBM Regressor | 0.961883 | 0.961843 | 0.473868 | 0.310384 |
XGBoost Regressor | 0.960489 | 0.960447 | 0.482460 | 0.318531 |
K-Nearest Neighbors | 0.861554 | 0.861408 | 0.903108 | 0.623587 |
Linear Regression | 0.830079 | 0.829900 | 1.000515 | 0.713975 |
Ridge Regression | 0.829130 | 0.828949 | 1.003305 | 0.716889 |
Elastic Net | -0.000093 | -0.001148 | 2.427280 | 1.824157 |
In actuality, it is not feasible to directly compare my results to the one in the research paper since we do not use the same test set. However, when I compared various results from other people's notebooks on Kaggle, I was able to achieve a better result just by doing a few data cleaning and feature engineering. I do not 100% trust these values though. Perhaps there are some columns in our training data which provide some think of info in the form of data leakage. Since I do not see anything related to this in the research paper, I have just simply gone with these values.
Kindly note that the above feature importances are from the random forest model and not lightgbm.