The NFL is an extremely high contact sport, and every year there are hundreds of injuries that jeopardize not only the team's chances at success, but also, and more importantly, the player’s long term health and careers. There are numerous factors that can be attributed to whether an injury will occur in any given game. However, we want to find out if there are some circumstances that are more associated with higher occurrences of injury such as the surface, position, and/or type of play that was called when an injury occurred. Additionally, we want to investigate whether we can accurately predict a specific player’s risk of injury.
With season, and potentially career, ending injuries on the rise, the NFL, in association with ESPN and SportsCenter, has contracted the 3 of us to investigate playing circumstances and draw any associations to lower body injuries. Additionally, we have also been asked to create a preliminary model to see whether we can accurately predict a given NFL player’s likelihood of injury. Prior to beginning, we had two hypotheses. First, we believed that injuries could strongly be associated with conditions like field type, position, and how many times a player is utilized in a game. Second, we believed that we could create a predictive model suggesting the likelihood of injury for a given player based on their past injury history and some statistics regarding their age, weight, height, and position.
- Data Utilization: The modeling for player injury predictions will leverage historical injury data agregated from weekly NFL injury reports. Both injury type and frequency for each player will be taken into account to predict future injury occurances. In the future, data such as player performance metrics and training intensity can be added to potentially improve modeling. For added context, data on game conditions like weather, field material, and stadium type was analyzed to find connections to injury occurance and severity.
- Analysis and Pattern Recognition: By analyzing patterns and correlations within the data, the model will provide insights into the key risk factors contributing to injuries.
- Injury Prevention Strategies: Utilizing the insights gained, the team can implement more effective injury prevention strategies, optimizing player health management and reducing the likelihood of injuries.
-
1_DataCollection/
src/
- Source code for data collection (e.g., APIs, web scraping)files/
- Raw files downloaded or used during collectionimages/
- Images or diagrams explaining the data collection process
-
2_DataCleaning/
src/
- Source code for data cleaning (e.g., data cleaning scripts)cleaned_data/
- Output of cleaned datalogs/
- Logs generated during the cleaning process
-
3_DataExploration/
notebooks/
- Jupyter notebooks for exploratory analysisreports/
- Reports or summaries of findingsvisualizations/
- Initial visualizations used for exploration
-
4_FeatureEngineering/
src/
- Code for feature engineering (transformations, feature selection)features/
- Generated feature datasets
-
5_ModelDevelopment/
models/
- Saved model files (e.g., .pkl for sklearn models)notebooks/
- Jupyter notebooks for model development and testingsrc/
- Additional scripts for model training and validation
-
6_ModelEvaluation/
results/
- Evaluation results, metricsvisualizations/
- Visualizations of model performance (e.g., ROC curves, confusion matrices)
-
7_Deployment/
src/
- Code for deploying models (e.g., Flask app, batch scripts)configs/
- Configuration files for deployment (e.g., .env, Dockerfile)
-
- This dataset supports the Big Data Bowl 2023. You'll find new data gathered and aggregated via various APIs and scrapes:
- Weekly team rosters
- Play-by-play data with features related to rushing and tackles
- Charting data by play and game (see specific licensing terms),
- Injury data
- Data from the NFL Combine event
The data was collected by JohnM and is available on Kaggle.
-
- This dataset is separately taken from another competition hosted by the NFL, NFL 1st and Future - Analytics, and contains:
- 105 lower body injury records from 2 subsequent seasons
- Play-by-play data with information like weather, temperature, play type, position involved, bodily part injured, etc.
- Player track data (this we did not utilize)
Upon reviewing the performance metrics of out Logistic Regression model, we have observed the following outcomes:
- The model’s accuracy is approx. 83%, which implies a reasonably high right of correct predictions for injury. We recognize that accuracy is not the most appropriate indicator of performance especially since our data set was imbalance (80/20)
- The model’s precision of our model is approx. 43%, which reflects the proportion of true positives in all positive predictions in this model. This is a great start but leaves great room for improvement, as the majority of the predictions were false positives.
- As a note, because we assume our clients will rather offer preventative measures even if the predictions are not precise this measurement is the focus of our analysis.
- The model’s recall is approx. 23%, the model demonstrated a less than average outcome of identifying actual injury instances. This metric allows us to understand where our model is lacking and what methods can be implemented to improve the results of our model overall.
- Finally, our model’s F1 Score was approx. 30%. This metric shows the lack of balance in our precision and recall scores and with the implemented suggestions for improving the robustness of our model we believe that we can increase our model’s performance to a standard that is acceptable for real-world use.