Case studies for The Art of Feature Engineering

The code behind these case studies is intended as a communication tool for the ideas expressed in the book. This code is as far from production code as it can get. Moreover, many of the techniques presented are available as part of some of the underlining libraries used. Showing a step-by-step implementation of different feature engineering techniques is intended.

See the companion website for details about the book: http://artoffeatureengineering.com/.

The style used in Python is also intentionally kept simple for people coming from other languages that plan to use the ideas described in the book outside of Python.

As described in the book, the case studies explore as many feature engineering ideas within the limits of:

At most one day of execution time per notebook.
No GPU required.
Minimal dependencies.
At most 8Gb of RAM.

As a result of these constraints, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.

Minor issues:

As these case studies are foremost an educational tool, I expect readers might want to try variants of some cells in isolation. To help with that, I have tried for the cells to be executable without having to re-run the whole notebook. That means that most cells read everything they need from disk and write all their results back into disk. This is unnecessary with normal notebooks as the values remain in memory, so the code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program. To solidify the vision of independent tweaking, I am also distributing these intermediate files besides the input data.

I dislike Pandas with a passion and discourage its use at any level. These notebooks are Pandas-free, which might seem unusual to some.

The last topic in the last chapter (recommendation as imputation) uses more than 8Gb of RAM.

Fetching the data

The data is available in both Zip and Tar BZip2 files. Chapter 9 (images) uses a tile set provided by NASA. It contains 88 thousand tiles occupying 6Gb of space. These tiles are used at the beginning of Chapter 9's notebook to generate 80 thousand boxes around each city or town. These boxes occupy less than 1Gb of disk space. As such, I am distributing the boxes and leaving the tiles for a separate download, in the event you might want to experiment with other techniques extracting more data from the original tiles. Otherwise the feature engineering techniques in Chapter 9's notebook should run fine from the extracted boxes.

(Book Data, Zip format 2.4Gb)[http://artoffeatureengineering.com/data/feateng_data.zip]
(Book Data, Tar BZip2 format 1.8Gb)[http://artoffeatureengineering.com/data/feateng_data.bz2]
(Tile Data, Zip format 5.4Gb)[http://artoffeatureengineering.com/data/feateng_tiles.zip]
(Tile Data, Tar BZip2 format 5.3Gb)[http://artoffeatureengineering.com/data/feateng_tiles.bz2]

Set-up a virtual environment

python3 -m venv feateng source ./feateng/bin/activate

Install local dependencies for graphviz

sudo apt install python-pydot python-pydot-ng graphviz

General dependencies

pip3 install jupyter pip3 install ipykernel pip3 install scikit-learn pip3 install lxml pip3 install numpy pip3 install scipy pip3 install matplotlib pip3 install graphviz

Dependencies for Chapter 7

pip3 install statsmodels

Dependencies for Chapter 8

pip3 install stemming pip3 install gensim

Dependencies for Chapter 9

pip3 install opencv-python

Dependencies for Chapter 10

pip3 install geopy

Launch Jupyter with the feateng environment

python -m ipykernel install --user --name feateng jupyter notebook --no-browser .

Extras

The folder tourism contains a case study for the feature engineering chapter in the book Applied Data Science in Tourism: Interdisciplinary Approches, Methodologies and Applications. It uses pyspark to solve an AirBnB price prediction task.

An extension and improvment for the case studies in Chapter 10 is available in the repository for the RIIAA'20 Workshop "Feature Engineering for Spatial and Temporal Data ".

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
tourism		tourism
.gitignore		.gitignore
Chapter10.ipynb		Chapter10.ipynb
Chapter6.ipynb		Chapter6.ipynb
Chapter7.ipynb		Chapter7.ipynb
Chapter8.ipynb		Chapter8.ipynb
Chapter9.ipynb		Chapter9.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Case studies for The Art of Feature Engineering

Fetching the data

Set-up a virtual environment

Install local dependencies for graphviz

General dependencies

Dependencies for Chapter 7

Dependencies for Chapter 8

Dependencies for Chapter 9

Dependencies for Chapter 10

Launch Jupyter with the feateng environment

Extras

About

Releases

Packages

Languages

License

DrDub/artfeateng

Folders and files

Latest commit

History

Repository files navigation

Case studies for The Art of Feature Engineering

Fetching the data

Set-up a virtual environment

Install local dependencies for graphviz

General dependencies

Dependencies for Chapter 7

Dependencies for Chapter 8

Dependencies for Chapter 9

Dependencies for Chapter 10

Launch Jupyter with the feateng environment

Extras

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages