This project aims to design, develop and implement the training model by using different inputs data. The machine will able to learn the features and extract the crop yield from the data by using data mining and data science techniques.
Algorithms used in this project:
-
Gradient Boosting Regression, validated by 5 Folds Cross Validation (for crop yield prediction) Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
-
Multi-variate Regression (for Fertilizer Recommendation) Multivariate Regression is a method used to measure the degree at which more than one independent variable (predictors) and more than one dependent variable (responses), are linearly related.
References to Dataset (column wise):
- State - Name of the 14 states in USA
- Year - Data from 1964-2017 (Historical occurences and records)
- Name of Fertilizer (%) - Acreage receiving the particular fertilizer
- Name of Fertilizer (Pounds/Acre) - Amount of the fertilizer being given to that acreage
- Area Planted - As mentioned
- Harvested Area - As mentioned
- Lint Yield - Yield of the cotton crop
- As part of pre-processing, the missing values were replaced with the mean values. Further to increase accuracy, the values were later replaced with the mean values of that feature, corresponding to the state; since states were showing varied values that weren’t in correlation with each other.
- Unique values of States have been noted and were mapped to corresponding integral values, to get a dataset that was capable to be trained on the regression model used.
- Pearson’s correlation used to check for redundant features. Fortunately, all the features show a value greater than |0.2| which shows a medium-high relation between the features and the target variable (Lint Yield)
Accuracy of 90.1% for Test set and 98% for Train set (no case of underfit/overfit). Still to rule out the possibility, 5 Folds Cross Validation is being used to re-check the accuracy score. After validation, 82.5% is accuracy.
- MAE= 71.42171078565305
- MSE= 9318.895794593871
- R2 value= 0.9072612517186169
- Adjusted R2 value= 0.906016436305444
- RMSE (train)= 44.51231769473379
- RMSE (test)= 96.53442802748599