ds-bigdata - package `extractbda`

An interface for easy access to data and final ML-predictor for our Big Data Analytics project. By installing and importing this package, you can very easily create an object that holds the data and our best predictor (a RandomForestRegressor). You can then run your own analysis and compare your results to ours.

Usage

!pip install -U git+https://github.com/gatto/ds-bigdata.git
from extractbda import Bikes
bik = Bikes(geo_k=21)

Parameters for `Bikes()`

`geo_k = 21|11|6` (default 11)

How many zones to divide the dataset in. Although the default is 11, we did most of our analysis using 21.

`val = True|False` (default False)

If to provide a holdout validation set besides the standard training and test sets. Use False if cross-validating.

Objects found in `bik`

predictor-related

bik.model["RF"] (scikit-learn.RandomForestRegressor object)
bik.model["y_pred"] (prediction over x_test)
bik.model["r2"] (r-squared score for RF)
bik.model["mse"] (mse metric for RF)

train/test datasets

bik.d[“x_train”]
bik.d[“x_test”]
bik.d[“x_val”] (only if creating Bikes(val=True))
bik.d[“y_train”]
bik.d[“y_test”]
bik.d[“y_val”] (only if creating Bikes(val=True))

whole datasets

bik.geo_df_SD (dataset with zones, seasons and weathersit dummies. Used in the train/test datasets above)
bik.geo_df (dataset with no dummies)

Notes on the model

We choose as target cnt: the total count of how many bikes were taken out over a granularity of one day and one zone. We have different aggregations of zones: either 6, 11 or 21 zones. The model was trained on 21 zones.

Fig - 11 zones partitioning

Fig - 21 zones partitioning

No trend features were inserted and the data was not treated as time series because we don't think there are causality links between the cnt of one day and the cnt of the next or previous day.

Notes on some attributes

dteday: date
season: season (1:winter, 2:spring, 3:summer, 4:fall)
yr: year (0: 2011, 1:2012)
mnth: month (1 to 12)
hr: hour (0 to 23)
holiday: day is holiday or not
weekday: day of the week
workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit:
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- (DELETED) 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: Normalized temperature in Celsius. The values are derived via (t-tmin)/(tmax-tmin), tmin=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-tmin)/(tmax-tmin), tmin=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
jupyter notebooks		jupyter notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ds-bigdata - package `extractbda`

Usage

Parameters for `Bikes()`

`geo_k = 21|11|6` (default 11)

`val = True|False` (default False)

Objects found in `bik`

predictor-related

train/test datasets

whole datasets

Notes on the model

Notes on some attributes

About

Contributors 4

Languages

License

gatto/ds-bigdata

Folders and files

Latest commit

History

Repository files navigation

ds-bigdata - package extractbda

Usage

Parameters for Bikes()

geo_k = 21|11|6 (default 11)

val = True|False (default False)

Objects found in bik

predictor-related

train/test datasets

whole datasets

Notes on the model

Notes on some attributes

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages

ds-bigdata - package `extractbda`

Parameters for `Bikes()`

`geo_k = 21|11|6` (default 11)

`val = True|False` (default False)

Objects found in `bik`