A machine learning playground to test out some machinery and ideas.
ML is an empirical science meaning there are well established statistical and numerical techniques but data being what it is, requires tremendous experimentation to determine what works and what doesn't.
Whoever has the data and the means for experimentation (manpower and compute) could potentially win big.
A flowchart to help guide you.
- If you know everything about your data, then exit
- If you can use analytical tools with standard formulas (MS Excel, etc), then exit (Note: There are LP solvers which can do 1 million variables)
- If your data has too much variability, then exit
- If you can use an existing model to solve your problem, then do so, exit.
- If you do not have a lot of data (>100K or so), then exit
- It is all about predictions - 2-value (binary) or N-value (multi-class). What kind of problem is yours?
- Is your data structured (properly formatted?), if not, do so
- Do you have labeled data or unlabeled data?
- Take repeated small samples, and a simple model, and test for bias and variance. If high variance, then likely 3. above, so either try a different model, or exit
- Knowing a little more about the data is helpful, in terms of picking the right features, and the right model. If not, repeat 8 till you do.
- Now train and validate your model against your datatset and real world examples. This will be pricey depending on the workload. Steps 1..9 were to save you time and money.
- Hopefully, Step 10 was fruitful. But we are not done.
- As we interact with the real world, we acquire more data, and some of 1..10 must be constantly repeated.
- Python
- Keras - growing on me for its simplicity and also uses Tensorflow as a backend
- scikit, numpy, etc