I recommend downloading Anaconda. It is the best platform for using python as it comes pre-loaded with all the necessary libraries and it has IPython, Jupyter Notebooks, Spyder IDE, and RStudio! It's amazing.
This is time series data about people getting coffee from a machine. Despite the file being about coffee, this is a very great way to see how pandas can be used and it gets very in-depth.
File: filter.py
This covers many basic operations for filtering and selecting data!
File: Missing.py
This covers how to treat missing values!
File: Remove.py
This covers how to remove duplicates!
File: concatANDtransform.py
This covers how to concatenate different dataframes and transform data!
File: group.py
This covers how to create subsets of data for more specific analysis!
We need to see how things look for better understanding.
File: standard.py
File: define.py
File: format.py
File: labels.py
File: timeseries.py
File: numpy.py
File: Summary.py
Mean, Median, Standard Deviation, Variance!
File: categorical.py
File: parametric.py
Pearson Coefficient.
File: nonpam.py
Spearmans rank and chi-squared tests.
File: distribution.py
You scale variables so that differences in magnitudes don't produce erroneous and misleading results. There are two ways to scale data. One is normalization and the other is standardization. Normalization creates a value between 0 and 1 and standardization creates 0 mean and unit variance.
File: factor.py
Used to find root causes that explain what causes data to act a certain way, for lack of a better expression. Factors are latent variables that are meaningful but are not directly observable. You regress on feautures in order to discover factors that you can use as variables to represent original data set. These are usually synthetic representations with extra dimensionality.
Factor Analysis assumes your features are continuous or ordinal. That you have correlation coefficient greater than .3, more than 100 observations present and more than 5 observations per features, which in minimial case is 20 features. Sample must also be homogenous.
File: pca.py
File: outliers.py
File: mvoutliers.py
File: linearprojection.py
File: Kmeans.py
This allows you to parititon your data into K clusters with each data point being appointed to a cluster with the nearest mean. The space will then look like Voronoi cells. This is known as unsupervised learning which means our data isn't labeled. We will need to scale our variables and look at a scatterplot or the data table to estimate the number of centroids, or cluster centers, to set for the k parameters in the model.
File: hcluster.py
Hierarchical clustering is considered unsupervised learning as well. This allows you to predict subgroups within data by finding distance between each data point and its nearest neighbor linking the most nearby neighbors. We find number of subgroups by looking at a dendrogram.
File: knn.py
Okay, this is supervised learning! It basically learns from observations and tries to predict classification for new observations. Data set must be labeled, and be small. Avoid using this algorithm on large datasets.
File: linreg.py
Simple use case would be: Let's predict the price of a home given its: kitchen size, square footage, bedrooms, yard size, and crime rate in the area. These have to be in a numeric-linear relationship with out target value, price.
Assumptions:
- Our variables must be numerical, not categorical, however I will get into how you could still use categorical values.
- Data is free of missing values & outliers.
- linear relationship between feature variables and target variable.
- All predictors are independent of each other!
- Residuals (error variable) is normally distributed with unit variance.
File: nbc.py
Three Types:
- Multi-nomial: When features (numeric/categorical) describe discrete frequency, like counts.
- Bernoulli: When features are binary (0 or 1).
- Guassian: When feautures are normally-distributed.
Assumptions:
- Features are independent of each other.
- A priori assumption: past conditions still hold true. When we make predictions from historical values, we will get incorrect results if present circumstances have changed!