Skip to content

hanhanwu/Hanhan_Data_Science_Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hanhan_Data_Science_Resources

helpful resources for (big) data science

DATA PREPROCESSING

FEATURE ENGINEERING

  • Feature Selection: https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29
  • Why Feature Selection:
    • It enables the machine learning algorithm to train faster.
    • It reduces the complexity of a model and makes it easier to interpret.
    • It improves the accuracy of a model if the right subset is chosen.
    • It reduces overfitting.
  • Filter Methods, the selection of features is independent of any machine learning algorithms. Features are selected on the basis of their scores in various statistical tests for their correlation with the dependent variable. Example - Pearson’s Correlation, LDA, ANOVA, Chi-Square.
  • Wrapper Methods, try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. These methods are usually computationally very expensive. Example - Forward stepwise Selection, Backward stepwise Elimination, Hybrid Stepwise Selection (forward then backward), Recursive Feature elimination.
    • Backward stepwise selection requires the number of records n larger than the number of features p, so that the full model can be fit
    • Forward stepwise selection also works when n < p
    • Hybrid Approach will do forward selection first, then use backward to remove unnecessary features
  • Embedded Methods, implemented by algorithms that have their own built-in feature selection methods. Example - LASSO and RIDGE regression. Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients. Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients. Other examples of embedded methods are Regularized trees, Memetic algorithm, Random multinomial logit.
  • Differences between Filter Methods and Wrapper Methods
    • Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
    • Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
    • Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
    • Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
    • Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.

DATA MINING BIBLE


R


CLOUD PLATFORM MACHINE LEARNING


VISUALIZATION

-- Tableau Visualization

-- Python Visualization

-- R visualization

-- d3 visualization


DEEP LEARNING


Industry Data Analysis/Machine Learning Tools


Statistical Methods


Terminology Wiki

Data Analysis Tricks and Tips

ENSEMBLE

DEAL WITH IMBALANCED DATASET

TIME SERIES

  • ARIMA model

    • Tutorial: http://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/?utm_content=buffer529c5&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer
    • Step 1 - Visualize with time
    • Step 2 - Check Stationary Series - Stationarity Requirements
      • A very short course about Stationary vs Non-stationary: https://campus.datacamp.com/courses/arima-modeling-with-r/time-series-data-and-models?ex=4
      • The mean of the series should be a constant, not a function (time independent/no trend)
      • Against Heteroscedasticity: the variance of the series should be constant (time independent); The time series under considerations is a finite variance process
      • The covariance of ith term and (i+m)th term should be constant (time independent); Autocovariance function depends on s and t only through their difference |s-t| (where t and s are moments in time)
      • Dickey Fuller Test of Stationarity: X(t) - X(t-1) = (Rho - 1) X(t - 1) + Er(t), the hypothesis is "Rho – 1 is significantly different than zero", if it got rejected, you get a stationary time series
      • You can try log() and diff() to make the data stationary. Logging can help stablize the variance, then Differencing looks at the difference between the value of a time series at a certain point in time and its preceding value. That is, Xt−Xt−1 is computed. Differencing can help remove the trend of the data and therefore make it stationary (detrend). To sum up, logging against Heteroscedasticity, differencing against the trend of the mean.
    • R methods to check stationary: http://www.statosphere.com.au/check-time-series-stationary-r/
      • with Acf() and Pacf(), if there are only a few lags cross the blue line, later ones soon die off, means it's stationary
      • Ljung-Box test examines whether there is significant evidence for non-zero correlations at lags 1-20. Small p-values (i.e., less than 0.05) suggest that the series is stationary.
      • Augmented Dickey–Fuller (ADF) t-statistic test: small p-values suggest the data is stationary and doesn’t need to be differenced stationarity.
      • Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test; here accepting the null hypothesis means that the series is stationarity, and small p-values suggest that the series is not stationary and a differencing is required.
    • Step 2 - To Bring Stationarity - without stationarity, you cannot build a time serious model!
      • Random Walk is NOT stationary process, the next step depends on the previous one, there will be time dependent
      • Introduced coefficient - Rho: E[X(t)] = Rho *E[ X(t-1)], 0<= Rho < 1 can bring stationarity, Rho=1 is random walk
    • Step 3 - After Stationarity, is it an AR or MA process?
      • ARMA - not applicable on non-stationary series. AR (auto regression), MA (moving average). In MA model, noise / shock quickly vanishes with time. The AR model has a much lasting effect of the shock. The covariance between x(t) and x(t-n) is zero for MA models, the correlation of x(t) and x(t-n) gradually declines with n becoming larger in the AR model.
      • PACF is partial correlation function. In ACF, AR model or ARMA model tails off, MA model cuts off (higher than the blue line and not the one) after lag q. In PACF, MA model or ARMA model tails off and AR model cuts off after lag q. In a word, ACF for MA model, PACF for AR model. ACF is a plot of total correlation. The lag beyond which the ACF cuts off is the indicated number of MA terms. The lag beyond which the PACF cuts off is the indicated number of AR terms.
      • Autoregressive component: AR stands for autoregressive. Autoregressive parameter is denoted by p. When p =0, it means that there is no auto-correlation in the series. When p=1, it means that the series auto-correlation is till one lag.
      • Integration is the inverse of differencing, denoted by d When d=0, it means the series is stationary and we do not need to take the difference of it. When d=1, it means that the series is not stationary and to make it stationary, we need to take the first difference. When d=2, it means that the series has been differenced twice. Usually, more than two time difference is not reliable.
      • Moving average component: MA stands for moving the average, which is denoted by q. In ARIMA, moving average q=1 means that it is an error term and there is auto-correlation with one lag.
      • Find optimal params (p,d,q)
    • Step 4 - Build ARIMA model and predict, with the opitmal parameters found in step 3
    • My R code (more complete): https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/time_series_predition.R
  • Besides using ARIMA model, Control Chart is a sattistical method that can be used to do time series analysis. It is a graph used to study how a process changes over time. Data are plotted in time order. A control chart always has a central line for the average, an upper line for the upper control limit and a lower line for the lower control limit. These lines are determined from historical data.

    • Control Chart Wiki: https://en.wikipedia.org/wiki/Control_chart

    • About Control Chart: http://asq.org/learn-about-quality/data-collection-analysis-tools/overview/control-chart.html

      • When controlling ongoing processes by finding and correcting problems as they occur.
      • When predicting the expected range of outcomes from a process.
      • When determining whether a process is stable (in statistical control).
      • When analyzing patterns of process variation from special causes (non-routine events) or common causes (built into the process).
      • When determining whether your quality improvement project should aim to prevent specific problems or to make fundamental changes to the process.
    • Control Chart in R: https://cran.r-project.org/web/packages/qicharts/vignettes/controlcharts.html

      • The individual/moving-range chart is a type of control chart used to monitor variables data from a business or industrial process for which it is impractical to use rational subgroups.
      • It is important to note that neither common nor special cause variation is in itself good or bad. A stable process may function at an unsatisfactory level, and an unstable process may be moving in the right direction. But the end goal of improvement is always a stable process functioning at a satisfactory level.
      • Since the calculations of control limits depend on the type of data many types of control charts have been developed for specific purposes.
      • C chart is based on the poisson distribution.
      • U chart is different from the C chart in that it accounts for variation in the area of opportunity, e.g. the number of patients or the number of patient days, over time or between units one wishes to compare. If there are many more patients in the hospital in the winter than in the summer, the C chart may falsely detect special cause variation in the raw number of pressure ulcers. U chart plots the rate. The larger the numerator, the narrower the control limits.
      • P chart plots proportion/percentage. In theory, the P chart is less sensitive to special cause variation than the U chart because it discards information by dichotomising inspection units (patients) in defectives and non-defectives ignoring the fact that a unit may have more than one defect (pressure ulcers). On the other hand, the P chart often communicates better.
      • Prime control chart, use when control limits for U, P charts are too narrow. The problem may be an artefact caused by the fact that the “true” common cause variation in data is greater than that predicted by the poisson or binomial distribution. This is called overdispersion. In theory, overdispersion will often be present in real life data but only detectable with large subgroups where point estimates become very precise.
      • G chart, When defects or defectives are rare and the subgroups are small, C, U, and P charts become useless as most subgroups will have no defects. The centre line of the G chart is the theoretical median of the distribution (mean×0.693 This is because the geometric distribution is highly skewed, thus the median is a better representation of the process centre to be used with the runs analysis. Also note that the G chart rarely has a lower control limit.
      • T chart, similar to G chart, it is for rare events, but instead of displaying the number of events between dates, it displays the number of dates between events.
      • I chart & MR chart, for individual measures (I think it means individual feature), I chart is often accompained with MR chart, which measures the moving range (absolute difference between neughboring data. If in MR chart, there will be points higher than the upper limit, needs special attention
      • Xbar chart & S chart, display the average and the standard deviation of a column
      • Standardized a control chart, creates a standardised control chart, where points are plotted in standard deviation units along with a center line at zero and control limits at 3 and -3. Only relevant for P, U and Xbar charts. With this method, your visualization is becoming more readable, but you also lose the original units of data, which may make the chart harder to interpret.
    • Control chart vs run chart

      • A run chart is a line graph of data plotted over time. By collecting and charting data over time, you can find trends or patterns in the process.
      • In practice, you can check run chart first, and when checking outliers, use control chart to investigate. But when the events are rare, start with G, T charts first could be better
    • My R practice code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/control_charts.R

  • Time Series skills test: https://www.analyticsvidhya.com/blog/2017/04/40-questions-on-time-series-solution-skillpower-time-series-datafest-2017/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29

    • Clusters of observations are frequently correlated with increasing strength as the time intervals between them become shorter.
    • Besides RA, MA models, there are:
      • Naïve approach: Estimating technique in which the last period’s actuals are used as this period’s forecast, without adjusting them or attempting to establish causal factors. It is used only for comparison with the forecasts generated by the better (sophisticated) techniques.
      • Exponential Smoothing, older data is given progressively-less relative importance whereas newer data is given progressively-greater importance.
    • MA specifies that the output variable depends linearly on the current and various past values of a stochastic (imperfectly predictable) term.
    • autocovariance is invertible for MA models
    • White noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. In discrete time, white noise is a discrete signal whose samples are regarded as a sequence of serially uncorrelated random variables with constant mean and finite variance. So, noise can be a component of time series model.
    • A white noise process must have a constant mean, a constant variance and zero autocovariance structure (except at lag zero, which is the variance)
    • Seasonality exhibits fixed structure; By contrast, Cyclic pattern exists when data exhibit rises and falls that are not of fixed period.
    • If the autocorrelation function (ACF) of the differenced series displays a sharp cutoff and/or the lag-1 autocorrelation is negative–i.e., if the series appears slightly “overdifferenced”–then consider adding an MA term to the model. The lag beyond which the ACF cuts off is the indicated number of MA terms.
    • We can use Multiple box or Autocorrelation to detect seasonality in time series data. The variation of distribution can be observed in multiple box plots. Autocorrelation plot should show spikes at lags equal to the period.
    • Tree Model vs Time Series Model: A time series model is similar to a regression model. So it is good at finding simple linear relationships. While a tree based model though efficient will not be as good at finding and exploiting linear relationships.
    • A weakly stationary time series, xt, is a finite variance process such that "The mean value function, µt, is constant and does not depend on time t, and (ii) the autocovariance function, γ(s,t), defined in depends on s and t only through their difference |s−t|." Random superposition of sines and cosines oscillating at various frequencies is white noise. white noise is weakly stationary or stationary. If the white noise variates are also normally distributed or Gaussian, the series is also strictly stationary.
    • Two time series are jointly stationary if They are each stationary and Cross variance function is a function only of lag h  * First Differencing = Xt - X(t-1) ...... (1)
    • Second Differencing is the difference between (1) results. While First Differencing eliminates a linear trend, Second Differencing eliminates a quadratic trend.
    • Cross Validation for time series model, time series is ordered data, so the valication should also be ordered. Use Forward Chaining Cross Validation. It works in this way: fold 1 : training 1, test 2; fold 2 : training [1 2], test 3; fold 3 : training [1 2 3], test 4.....
    • BIC vs AIC: When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC. BIC penalizes complex models more strongly than the AIC. At relatively low N (7 and less) BIC is more tolerant of free parameters than AIC, but less tolerant at higher N (as the natural log of N overcomes 2). https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other
  • 3 Winners deal with mini time series challenge (very interesting, especially after seeing the champion's code..): http://www.analyticsvidhya.com/blog/2016/06/winners-mini-datahack-time-series-approach-codes-solutions/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29

  • Inspiration from IoT Feature Engineering

  • Inspiration from the champion's time series methods

Segmentation

Use Clustering with Supervised Learning


Machine Learning Experiences


CROWD SOURCING


GOOD TO READ

-- In this article, when they were talking about concepts such as Activation Function, Gradient Descent, Cost Function, they give several methdos for each and this is very helpful, meanwhile, I have leanred deeper about BB through the concept of Momentum, Softmax, Dropout and Techniques dealing with class imbalance, very helpful, it is my first time to learn deeper about these

-- From the above article, I have made the summary that I think needs to keep in mind:


DATA SCIENCE INTERVIEW PREPARATION


LEARING FROM THE OTHERS' EXPERIENCES


OTHER

About

helpful resources for (big) data science

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages