curated list of awesome tools and libraries for specific domains
- python
- plotting raster https://github.com/fmaussion/salem
- raster handling http://xarray.pydata.org/en/stable/
- multi dimensional arrays http://xarray.pydata.org/en/stable/
- spatial data including joins (works with dask) http://geopandas.org
- cleaning of addresses: https://github.com/openvenues/libpostal
- postgis
- multi dimensional
- hadoop
- http://www.geomesa.org
- https://github.com/DataSystemsLab/GeoSpark
- https://github.com/harsha2010/magellan
- https://github.com/locationtech/geowave
- https://github.com/locationtech/geotrellis
- https://github.com/Esri/spatial-framework-for-hadoop and https://github.com/Esri/gis-tools-for-hadoop as well as their java api https://github.com/Esri/geometry-api-java
- http://www.nltk.org/book/
- https://github.com/keon/awesome-nlp
- https://github.com/JohnSnowLabs/spark-nlp
- https://github.com/databricks/spark-corenlp (check license extra carefully for commercial setup)
- pyspark with https://spacy.io
- https://explosion.ai
- https://github.com/clulab/processors
- https://github.com/google/sling
- https://github.com/facebookresearch/faiss
- https://github.com/bplank/bilstm-aux
- https://github.com/facebookresearch/fastText
- https://github.com/facebookresearch/InferSent
- parsing HTML
- clustering
- general operations
- logging & alerting
- certificates
- https://certbot.eff.org and https://letsencrypt.org for free and automated https/ssl certificates
- hadoop monitoring
- https://github.com/linkedin/dr-elephant
- https://github.com/qubole/sparklens
- https://sites.google.com/site/sparkbigdebug/home
- https://github.com/SparkMonitor/varOne (not maintained)
- performance test https://github.com/databricks/spark-perf
- https://github.com/conversant/spark-profiler
- testing
- data quality
- packer base images
small
- prediction
- feature extration
hadoop
- handling & prediction
- https://github.com/sryza/spark-timeseries
- https://spark-summit.org/2016/events/huohua-a-distributed-time-series-analysis-framework-for-spark/
- https://github.com/twosigma/flint
- https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html
- correlation https://github.com/Sotera/correlation-approximation
- https://github.com/sryza/spark-timeseries
- anomaly detection
- storage
model metadata
- https://github.com/IDSIA/sacred
- http://studio.ml (also hyper opt)
- https://github.com/mitdbg/modeldb
- https://dataversioncontrol.com
- https://www.comet.ml
- https://aetros.com
model building
- feature engineering
- small
- http://scikit-learn.org/stable/
- R
- python
- hadoop
- ensembling
- specific great models
- gradient boosted trees
- xgboost
- lightgbm
- catboost https://github.com/catboost/catboost
- gradient boosted trees
- visualization of results
model serving
- own API wrapper around original model code
- http://clipper.ai
- https://www.acumos.org
- https://polyaxon.com
- http://vespa.ai
- https://github.com/RedisLabsModules/redis-ml
- https://riseml.com
- https://github.com/Hydrospheredata/mist
- https://github.com/Azure/ai-toolkit-iot-edge
- https://www.dominodatalab.com and various other cloud data science work benches
- https://datmo.com
- https://aws.amazon.com/de/sagemaker/
model serialization
hyperparameter tuning
- https://sigopt.com
- https://github.com/scikit-optimize/scikit-optimize
- https://github.com/Yelp/MOE
e2e
ml solutions
bridiging python / r and big data
- http://blog.madhukaraphatak.com/pipe-in-spark/
- sparklyR
- https://github.com/apple/turicreate out of core models on medium sized data
graph processing
- hadoop
- non hadoop
- https://neo4j.com (single master, multi slave cluster possible)
- tutorial
- telco hadoop geospatial
- https://www.youtube.com/watch?v=VtvP54Xo3Ek&feature=youtu.be
- streaming and declarative models: https://www.youtube.com/watch?v=Do7C4UJyWCM
- ml
- ml pipelines https://www.youtube.com/watch?v=cpR6Vkp7ImA
- shingles and pipelines https://www.youtube.com/watch?v=qkrh35IF2SU, https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science
- gradient boosting comparision: https://www.youtube.com/watch?v=5CWwwtEM2TA
- streaming
- kafka https://www.youtube.com/watch?v=MNPI925PFD0
- spark streaming in depth https://www.youtube.com/watch?v=hyZU_bw1-ow
- python https://github.com/mrocklin/streamz
- python
- https://python-graph-gallery.com for inspiration
- seaborn
- R
- ggplot2 + grest themes
- javascript
bi & dashboarding
- https://metabase.com
- https://looker.com
- python
- https://github.com/stitchfix/pyxley notebooks
- jupyter
- zeppelin
type safety
- stan
- pymc3
- https://github.com/uber/pyro
- https://www.cockroachlabs.com (spanner)
- https://www.snowflake.net/de/
- https://snowplowanalytics.com/products/snowplow-open-source/
- hbase-spark
- postgres on GPUs http://www.brytlyt.com
- improved cassandra scylla http://www.scylladb.com
- https://www.mapd.com/platform/
- https://clickhouse.yandex
time series DBs
big real time analytics and data integration
- https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
- https://www.quora.com/Should-I-use-Gobblin-or-Spark-Streaming-to-injest-data-from-Kafka-to-HDFS/answer/Prithiviraj-Damodaran
- typesafe configuration
- https://cir.is/docs/validation
- https://github.com/pureconfig/pureconfig
- founding / payments https://stripe.com/atlas
- errors
- https://github.com/actionml/universal-recommender
- https://github.com/DataSystemsLab/recdb-postgresql
- apache atlas
- cloudera navigator
- https://www.waterlinedata.com (hadoop only)
- https://alation.com (all)
- https://www.privitar.com
- data mining