end-to-end-data-science-project

The project goes through the process of completting an end to end Data Science project, starting with a problem statement and end with a deployed product that our client will be able to use.

Table of Content

Problem statement
Business Case
Pick a proper Dataset
Structure DS Project
Data Exploration
Data Preprocessing
Data Analytics
Feature Engineering
ML Model Training

More in-details

1. Problem statement

A (fictional) client is an IT educational institute They have reached out to us with the following: “IT jobs and technologies keep evolving quickly This makes our field to be one of the most interesting out there But on the other hand, such fast development confuses our students They do not know which job profile is most related to the skills they already have or want to learn.“

“Do I need to learn C++ to be a Data Scientist? Do DevOps and System admins use the same technologies? I really like JavaScript can I use it in Data Analytics?”

Those are some of the questions that our students ask Could you please develop a data driven solution for our students to answer such questions? They mostly want to understand the relationships between the jobs and the technologies.

Possible solutions

Cluster most-related job profiles into groups.
Recommender ML model to recommend certain job profile based on a given skills/programing languages/framworks.

2. Business Case

What are the KPIs that you will positively impact?

Higher enrollment rate due to the higher certainty.
Decrease in drop out rate.
Time saved for the academic advisors.

3. Pick a proper Dataset

If you want data about IT job profiles, programming languages, and frameworks, the first place to look for such dataset is for sure stack overflow, so we will be using Stack Overflow Annual Developer Survey 2020. [ Download Dataset]

4. Structure DS Project

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Data Prepration & Modeling (from point 5 to 9)

All details are in 5 different notebooks in the following directory end-to-end-data-science-project/notebooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

end-to-end-data-science-project

Table of Content

More in-details

1. Problem statement

Possible solutions

2. Business Case

3. Pick a proper Dataset

4. Structure DS Project

Data Prepration & Modeling (from point 5 to 9)

Resource

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

License

Phylake1337/Data-science-end-to-end-project-from-preparation-to-production-

Folders and files

Latest commit

History

Repository files navigation

end-to-end-data-science-project

Table of Content

More in-details

1. Problem statement

Possible solutions

2. Business Case

3. Pick a proper Dataset

4. Structure DS Project

Data Prepration & Modeling (from point 5 to 9)

Resource

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages