Machine Learning Study on a Dataset of Epicurious Recipes
Data provided by Hugo Darwood via Kaggle: https://www.kaggle.com/hugodarwood/epirecipes
- Will take some input (base ingredients, cuisine style, nutritional content) and return a classifier
- Collaborative filtering ( If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar interests and A should like item 4 and B should like item 1)
- Higher or lower rating based on similar recipes (classifier)
- Estimated score (regressor)
Features are a mess
Rating distribution is NOT normalized
Feature Engineering: 680 columns
- Ingredients are already label encoded
- Nutrition content useful, but not relevant in recommendations based on user taste preferences
Initial EDA yields some pretty strange outliers:
Linearly Separable Data?
- Binary classification and linear models will have a hard time if data is not linearly seperable
- Use tree based models if so
Danger of over fitting: if model is trained on specific inputs, it will not be applicable to others
- Clustering
- Build a 'user' similarity matrix
Split and Clean Data
- Worth developing LSA to separate similar features (ingredients, nutrition, etc) Recommendation Engine
- Vectorize salient features
- build LSA models to train recommendations
- Cosine Similarity
- Pearson Similarity
- Jaccard Similarity
Rating Prediction
- Develop webscraper to collect new articles for rating prediction