J'ai seulement fait ici un amas de fleurs étrangères, n'y ayant fourni du mien que le filet à les lier.
My Apache Zeppelin and Jupyter notebooks, and more! For a series of valuable data analysis and machine learning-related stuff in general
This document attempts to develop a curated list of Machine Learning resources, including books, papers, software, libraries, notebooks, etc. Most of the libraries are for Python though the rest of the materials here are generally suited for working with data.
- Foundations of Machine Learning: I strongly suggest reading this book
- Readings in Database Systems(The Red Book): I strongly suggest reading this book
- A Course in Machine Learning: Good book to start learning ML
- Mining Massive Datasets: Great book about Big Data concepts, Data Mining algorithms, and their applications
- Networks, Crowds, and Markets: Reasoning About a Highly Connected World : Good starter book for Network Science and its applications (e.g. graph analysis, social network analysis)
- An Introduction to Statistical Learning
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- Arxiv.org/ML
- Python Machine Learning
- Python Data Science Handbook
- Whirlwind Tour Of Python: Good starter book for learning Python
- Python Machine Learning (second edition)
- Deep Learning Book(MIT Press)
- Probability and Statistics Cookbook
- An ML Cheat Sheet
- Hand-book on STATISTICAL DISTRIBUTIONS for experimentalists
- Deep Learning Papernotes: A repository of many of the research papers published about various DL-related topics over the years
- NLR Papers: A perfect collection of papers on Network Representation Learning and Network Embedding
- KRL Papers: A nice collection of papers on Knowledge Representation Learning and Knowledge Embedding
- Stanford CS 229 ML Cheatsheets: A nice collection of ML cheat-sheets on various important subject matters
- Machine Learning for Business: Machine Learning for Business teaches you how to make your company more automated, productive, and competitive by mastering practical, implementable machine learning techniques and tools
- A gentle introduction to Tensors and their uses: An introduction to Tensors and their sample applications
- Linear Algebra course book: Jim Hefferon's Linear Algebra book, A good companion book for learning linear algebra fundamentals
- Top 10 Data Mining Algorithms: A good article describing how 10 of the more famous Data Mining algorithms work
- Representation Learning: A Review and New Perspectives: An excellent introduction to Representation Learning and its implications
- NLTK Book: A great book if you want to process and analyze texts with NLTK
- An Introduction to Variable and Feature Selection
- Machine Learning Workflow with Python: A collection of useful ML-related stuff for people interested in working with the data
- Interpretable Machine Learning
- GNNPapers: A collection of research papers on Graph Neural Networks
- Mining Social Media: The web version of an easy-to-follow introductory book for mining social media data
- The Economist data visualization: A set of articles describing how the Economist uses data visualization
- Text Mining with R - A tidy approach
- Explanatory Model Analysis: Explore, Explain and Examine Predictive Models
- Graph Representation Learning
- Official Matplotlib cheat sheets
- Data Mining and Machine Learning: Fundamental Concepts and Algorithms, Second Edition
- Dive into Deep Learning: "Interactive deep learning book with code, math, and discussions" -- its website
- Fundamentals of Data Visualization
- Little Book of Deep Learning
- Understanding Deep Learning
- UCI Machine Learning Repository: Lots of exciting datasets piled up just for you to use!
- Kaggle: A very active community, a great place to learn from others
- Network Repository: Many network/graph datasets, If you like graphs, it's the place for you!
- Deep Learning Datasets: DL datasets, of course!
- MLDatasets: Another nice dataset repository
- Open Data for Deep Learning: Deep means big here, I guess!
- Wikipedia List of Datasets for Machine Learning Research: It's Wikipedia!
- GHTorrent: GHTorrent project is an attempt to make an offline queryable mirror of Github projects' data available for everyone
- SOTorrent: A very rich dataset of StackOverflow posts and related contents such as post comments
- Datalist.com: A handy list of ML-related datasets from all over the web
- awesome-twitter-data: An extensive collection of datasets from Twitter's data
- Dataset for Graph classification: A collection of datasets for classification on graphs
- Google's dataset search
- Quora Data Science: A superb place to ask and seek answers!
- Stack Exchange Data Science: Another friendly Q&A community with an emphasis on the technical side
- Kaggle: Kaggle again:)
- Quora Machine Learning: Quora again:)
- Stack Overflow: General Q&A for developers who need help with their code
- Kaggle: Kaggle again:)
- Reddit Machine Learning Community
- CrowdAI: A Kaggle alternative, popular among students
- Quora: Quora Q&A platform
- github.com: Github contains many valuable resources such as the code for many algorithms, all in one centralized platform!
- Apache Projects: A few hundred cool software projects are related to data management in some way! (e.g. Hadoop ecosystem)
- Stanford Machine Learning Course(Have a look at the project section!)
- NIPS Website: A very prestigious AI conference held every year
- Scipy Lectures
- Nice website about Data Mining
- ML Resources on Github
- A list of research works on a few interesting topics
- Open Machine Learning Course: An ML course covering so many topics
- Tanagra - Data Mining and Data Science Tutorials: TANAGRA's tutorials cover a vast amount of topics
- Papers with code: It is a convenient repository of research papers that are coming with their code published too, you can access the code from many recent cutting-edge algorithms from here
- Twitter datasets: A list of datasets related to the social platform Twitter
- KDnuggets
- Deepnote
- Google Colab
- DeepLearning.com: A website for resources related to everything DL
- Data Science blogs: A curated list of Data Science blogs
- PaddleHub: PaddleHub is a large repository of useful pre-trained ML models
- A humongous list of AI/Machine learning/Deep learning/Computer vision/NLP Projects with code
- Explanations of key concepts in ML
- NLP-Notebooks: A collection of notebooks covering conventional NLP tasks such as word embeddings, text classification, etc.
- 2021 DeepMind x UCL Reinforcement Learning Lecture Series: Video lectures from DeepMind covering the area of Reinforcement Learning
- Stanford Graph Learning Workshop: Recording of Stanford graph Learning workshop sessions on September 2021
- Machine Learning for Beginners - A Curriculum
- Introductory Machine Learning Course: An accessible machine learning course taught by professor Professor Yaser Abu-Mostafa from Caltech
- The Hugging Face Deep Reinforcement Learning Class
- Google Machine Learning Education: A series of courses that cover machine learning fundamentals and core concepts
- Materials for workshops on the Hugging Face ecosystem
- A large collection of machine learning tutorials
- Large Language Model Course
- Generative AI for Beginners (Version 2) - A Course
- Spyder: A great Python IDE for scientists in general
- Pycharm CE: An excellent IDE for the development of anything with Python
- GNU Emacs: GNU Emacs is an environment for doing almost anything
- IDLE: Default Python IDE, lean and clean environment to develop in Python
- Rodeo: A Python IDE for data scientists
- Anaconda: A very user-friendly environment for scientific Python development
- Miniconda
- Vowpal Wabbit
- StackNet
- Sofia ML
- LIBLINEAR
- LibFM
- SVM Rank
- Jupyter
- Apache Zeppelin: A great notebook environment for data visualization and doing analytics stuff, it can connect to many different databases and data management systems
- Beacker
- nteract
- JupyterLab: Next-generation Jupiter notebook environment
- Spark Notebook: Spark Notebook is an interactive notebook authoring environment for working with Scala code on top of Spark clusters
- Python(x,y): Python(x,y) is an open-source environment for scientific and numerical computations and analysis
- Polynote: A notebook authoring tool with native support for Scala on Spark from Netflix
- Pandas: Famous Python's data manipulation library
- Scipy: Defacto Pythons scientific computation library
- Numpy: Linear algebra library for fast numerical computation
- Scikit Learn: High-level Machine Learning library with tons of features, very easy-to-use and extendable
- Bokeh: An interactive high-level data visualization library
- Matplotlib: A compelling data visualization library, More low-level than other visualization libs
- Graph Tool: A fast and powerful library for working with graphs in Python. It's developed on top of Boost C++ libraries, so consequently, it's very efficient
- NetworkX: A Python module for Complex Network modelling and analysis, Very easy-to-use but may be slow occasionally because it's in pure Python
- TensorFlow: Low-level library for creating deep artificial neural networks, works both on CPU and GPU. Usually, you use TF in conjunction with a library with a higher-level API exposing TF's functionalities like Keras
- Keras: "Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano" - Keras's website
- NLTK: Swiss Army knife tool for text processing in Python
- Pattern: Another good text processing library for Python
- IPython
- Orange: Orange is a general-purpose data mining and analysis tool also library that lets you develop machine learning pipelines just by a few dragging and dropping
- Theano
- CatBoost: Yandex's implementation of Gradient Boosting on Decision Trees. It supports categorical features out of the box
- XGboost: Original XGBOOST library, A very efficient Gradient Boosting library with extra regularisation
- Mlxtend: A great Data Mining and Machine Learning library with
- NetworKit: A very high-performance graph processing and analysis toolkit written in C++ and uses OpenMP, so it is very fast on multicore computers
- Eli5
- Pandasql
- Dask: A fast data manipulation library with out-of-core handling of the data, Suited for a distributed environment, Its API is (exactly)compatible with Pandas' API
- MLBox
- Gensim
- Scikit-learn-Contrib/Imbalanced-learn: An extension library for Scikit-learn for handling imbalanced datasets
- Patsy: "Kamelot!!! ... It's just a model Shhhh!"
- Statsmodels: A Python package for building various statistical models
- Seaborn: A high-level visualization library for Python
- Pandas-profiling
- Blaze
- Altair
- Numba
- BigARTM
- GYM: An open-source toolkit for reinforcement learning from the Open AI project
- PyBrain: A Machine Learning library for Python with emphasis on modelling via many types of neural network architectures
- Sklearn-pandas
- Auto-ML
- Scikit-Learn Contrib/Lightning: An extension library to Scikit-learn for large-scale linear classification, regression, and ranking problems
- GPLearn
- Nengo
- Scikit-learn Contrib/*: A collection of extension libraries for Scikit-learn adding new (missing) functionalities to it
- Koolmogorov: A Python library for hierarchical clustering and visualization
- Lime: A tool for exploring and explaining the output of classifiers
- TreeInterpreter
- SNAP-Python: Python wrapper library for Stanford Network Analysis Platform (SNAP)
- Pycobra: A Python library implementing ensemble methods for regression, classification, and visualization tools, including Voronoi tessellations
- TF Learn: A library on top of TensorFlow providing a higher API than TensorFlow
- Featuretools: A Python library for automated feature engineering
- spaCy: NLP library with tons of features(like various CNN models)
- SymPy: Symbolic computation library for Python, Aiming to become a full-fledged CAS
- Uniform Manifold Approximation and Projection: A general non-linear dimensionality reduction algorithm implemented in Python
- Scikit-learn Contrib/HDBSCAN: A high-performance implementation of HDBSCAN clustering, HDBSCAN is a robust and easy-to-use clustering algorithm with minimal parameters, Ideal for exploratory data analysis; It works as an extension to Scikit-learn
- Turi Create: A fast tool/library for simplifying various ML tasks
- Scikit-learn-Contrib/Categorical-Encoding: An extension library for Scikit-learn that provides additional categorical feature encoding schemes(e.g. LeaveOneOut scheme)
- Optunity: A library for hyperparameter optimization
- Kmodes
- TF-Slim
- Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend" - Pyro's website
- GEM: A Python library that provides various graph embedding methods like 'node2vec' and 'locally linear embedding.'
- DynamicGEM: A dynamic graph embedding library like GEM
- GraphSAGE: A graph embedding framework to generate low-dimensional vector representations for nodes, instrumental if you need to use deep learning on graph data
- Horovd: A distributed training framework for TensorFlow, Keras, and PyTorch by Uber
- NetLSD: Python implementation of NetLSD, a scalable graph embedding algorithm for representing a graph via a low-dimensional vector
- SHAP: A tool for exploring and explaining the outcome of an arbitrary model
- NLPre: Another fantastic Python NLP library
- GCN: Python implementation of graph convolutional networks in TensorFlow
- AllenNLP: "An open-source NLP research library, built on PyTorch" - AllenNLP's repository documentations
- TensorLy: A Python Library for efficient Tensor operations
- CuPy: A Python matrix library accelerated by Nvidia CUDA, it's also compatible with Numpy's API
- Scikit-Multiflow: A Python library for Stream Mining
- MLflow: A software toolbox to manage ML projects' workflow and life-cycle, it aims to make ML software projects easier to implement by providing various helper components for each step
- pyGAM: A Python module for building Generalized Additive Models (GAMs)
- ggplot: "ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. It is built for making professionally-looking plots quickly with minimal code" - ggplot's website
- Linkpred: A Python package for link prediction on graphs
- SparklingGraph: A Python library to process large-scale graphs using Spark and GraphX in a distributed manner
- OpenNE: An open-source network embedding library
- Galry: A high-performance visualization library in Python
- Dedupe: A Python library for fuzzy entity resolution and record deduplication
- PyText: A deep-learning-based NLP modelling framework built on top of PyTorch
- flair: A state-of-the-art NLP framework in Python from Zalando
- NearPy: "A Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes," according to its descriptions
- fastchunking: A (fast) text chunking algorithm implemented in C++ and Python
- Vaex: Vaex is a data manipulation library much like Pandas and Dask, with a lazy out-of-core approach to handling the data so you can work with huge tables with it
- openTSNE: An extensible, parallel implementation of t-SNE
- Faust: A stream processing library for Python
- Active Semi-Supervised Clustering: An extension library for Scikit-learn that implements a collection of useful active semi-supervised clustering algorithms
- TextDistance: A Python library for calculating and comparing the Distance between two sequences (such as text documents) with many algorithms
- Ray: A scalable. high-performance distributed execution framework for executing arbitrary Python functions on multiple machines, suitable for many ML workloads
- Pyitlib: An opensource library for calculating a useful collection of information-theoretic measures (i.e., entropy) for discrete random variables
- KDEpy: A collection of useful kernel density estimators in Python 3.5+
- Tsfresh: A Python library for (automatic) feature extraction and engineering on time-dependent data
- GPy: A Python library for working with Gaussian processes
- Tslearn: A machine learning library dedicated to working with time-dependent data
- Ludwig: "Ludwig is a toolbox that allows to train and test deep learning models without the need to write code" - Ludwigs's website
- Record Linkage Toolkit: A Python software toolkit for record deduplication and linkage
- PyJanitor: Python port of R's janitor package for data cleansing and manipulation
- FastText: A library for fast and efficient text embedding and classification
- Mimesis: A fast and valuable fake data generation library
- PyOD: A Python software toolbox for scalable Outlier Detection (aka Anomaly Detection)
- Creme: A Python library for Online Learning and building incremental models
- vg: A linear algebra library much like Numpy with a more human-friendly interface
- GraphKernels: A fast library for calculating various graph kernels
- GraKeL: A graph kernel calculation library that is using Scikit-learn's API so it can be used with other functionalities and routines already present in Scikit-learn without much hassle
- Graphsim: A graph similarity extension library for NetworkX
- Textract: A general text extraction tool from many file formats
- Sacred: Sacred is a Python library to make an ML workflow easier to reproduce and manage for you!
- TextDistance: TextDistance is a Python library for calculating and comparing the Distance between two or more sequences of an arbitrary alphabet (e.g., words, DNA sequences), it has got over 30 distance algorithms to use
- Py_stringmathcing: Py_stringmathcing is a Python library consisting of a comprehensive set of string tokenizers (such as alphabetical tokenizers, whitespace tokenizers) and also string similarity measures (e.g., edit Distance, Jaccard distance)
- JGraph: JGraph is a WebGL graph drawing library for Python
- Kedro: A Python library and also a tool to manage your data analysis workflow in your projects
- PySAL: PySAL is a Python package for geolocation-based data analysis
- k-Shape: This is a Python implementation of the k-Shape clustering algorithm for clustering the time series data
- Pyforest: You could use Pyforest to import all Python data science-related libraries lazily as you need them in your code
- ETE Toolkit: ETE Toolkit is a Python toolbox for visualizing and analysis of tree format data
- Whoosh: Whoosh is a full-text indexing and search library for Python
- Geoplot: Geoplot is a Python visualization library for geospatial plotting of geolocational records
- GeoPandas: GeoPandas is a high-level library with an API similar to Pandas that makes working with geospatial datasets in Python much easier
- Edward: "A library for probabilistic modelling, inference, and criticism" - its website
- HyperTools: A Python library for high-dimensional data visualization and analysis
- TextRank: TextRank algorithm implementation for Python 3
- pymorton: A Python package for ordinal hashing of multidimensional points into a one-dimensional ordering
- PySS3: A Python package implementing SS3 text classifier with visualizations tools for explainable artificial intelligence (XAI)
- Lpproj: A Python implementation of Locality Preserving Projections (LPP) with Scikit-Learn compatible API
- Multi-Rake: Multilingual rapid automatic keyword extraction (Multi-RAKE) is a Python library for automatic text summarization and keyword extraction of text in many different languages
- PyCaret: PyCaret is Python library and tool for automating ML workflows
- ACME: A software framework for research on reinforcement learning
- fastText: A fast text representation learning and classification library from Facebook
- Distance: A useful library in pure Python to calculate the distance between arbitrary sequences
- Texhero: "Text preprocessing, representation and visualization from zero to hero" -- Texthero's website
- xLearn: "High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface." -- xLearn's description
- TextBlob: A text processing library with a high-level API
- Plotline
- Dtale: A Python tool to analyze data stored in pandas dataframes
- Lasagne: A lightweight library to build and train neural networks in Theano
- Magnitude: Magnitude is a vector embedding helper library (much in the spirit of Gensim)
- Missingno: A useful Python library for visualizing missing data
- Vector Hub: Vector Hub is a Python library that can help turn almost everything (including text, graph, and image data) into vector representations
- pyLDAvis: A Python library for interactive topic modeling and visualization of topics in textual datasets
- Pyextrank: A Python implementation of the TextRank algorithm
- Mitosheets: A Jupyter Lab extension to make it easier to work with Panda's dataframes
- Transformers
- CoClust: A Python library for co-clustering
- PySurvival: PySurvival is a Python package for survival analysis of data
- Scikit-survival: Scikit-survival is an extension to Scikit-learn that adds survival analysis capabilities to it
- rfpimp: A Python package that brings permutation-based feature importance measure to Scikit-learn Random Forests learners
- Jiant: Jiant is an NLP software toolkit with multitask and transfer learning capabilities
- PyG: "PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data."---PyG's documentation
- Nodevectors: A Python package with fast and scalable implementations for some popular vertex embedding algorithms
- JGraphT: JGraphT now supports Python
- DGL: Deep graph library is a Python package for using deep learning algorithms on graph data. It can use PyTorch, TensorFlow, or Apache MXNet as its backend
- Spektral: A Python package for creating and running graph neural networks
- HyperNetX(HNX): A Python library to work with data modelled as hypergraphs
- Graph4NLP: Graph4nlp is a Python library that makes it easier to use graph neural networks in and for NLP tasks
- JAX: JAX is a Python library for high-performance numerical computation used in machine learning
- Metric-learn: Metric-learn is Python library for metric learning. It's available as part of (scikit-learn-contrib)[https://github.com/scikit-learn-contrib] collection
- AmpliGraph: AmpliGraph is a Python library for representation learning on knowledge graphs
- Distfit: "Distfitis a python package for probability density fitting of univariate distributions on non-censored data"--Distfit's website
- Pke: Pke is a keyphrase extractor from the text in Python
- Albumentation: Albumentations is a library for image augmentation
- Spark NLP: Spark NLP is an NLP library for Python
- Skorch: Skorch is a neural network library that uses PyTorch as its backend and provides APIs compatible with the Scikit-learn machine learning library
- Optuna: Optuna is an open source hyperparameter optimization framework for automating hyperparameter search
- Anomalib: "A library for benchmarking, developing and deploying deep learning anomaly detection algorithms." -- Anomalib's webpage on github.com
- AutoKeras: A wrapper library around Keras that makes using Keras easier
- FLAML: "A Fast Library for Automated Machine Learning & Tuning" -- on FLAML's GitHub repository
- TextAugment: TextAugment is a text augmentation library for Python 3
- NLPAug: NLPAug is another text augmentation library for Python 3
- AutoTS: AutoTS (or Auto_TimeSeries) is a high-level Python library for training and building models on time-series data
- Embetter: Embetter is an easy-to-use Python library for embedding image and text data with a Scikit-learn-compatible API
- Lazy Predict: A handy library for evaluating the perfoamrnace of a collection of simple models on the data
- TensorLy: TensorLy is a Python library for performing common tensor operations, such as tensor decomposition, in an easy and fast way
- CogDL: "CogDL is a graph deep learning toolkit that allows researchers and developers to easily train and compare baseline or customized models for node classification, graph classification, and other important tasks in the graph domain." -- The description on CogDL's GitHub repository
- StellarGraph: StellarGraph is a Python library for machine learning on graph data
- Polars: Polars is a high-performance dataframe library, similar to Pandas, for Python (and Rust)
- Snap Machine Learning (SnapML): "A library that provides high-speed training of popular machine learning models on modern CPU/GPU computing systems" -- SnapML's website
- Sklearn-porter: Python library that helps convert trained scikit-learn models into various programming languages, such as Java, JavaScript, Ruby, C, PHP, and Swift
- Yellowbrick: Yellowbrick library provides a set of tools, known as "Visualizers", that enhance the Scikit-learn API by enabling interactive model selection. It merges Scikit-learn with Matplotlib to create visualizations that support your machine learning workflow
- Hypergraphx: Hypergraphx (HGX) is a Python library for higher-order network analysis
- Skforcast: "Skforecast is a Python library that eases using Scikit-learn regressors as single and multi-step forecasters. It also works with any regressor compatible with the Scikit-learn API [like LightGBM, XGBoost, and CatBoost]" -- Skforecast's website
- Functime: Functime is a Python library for fast forecasting and embeddings of time series data
- trl: Transformer Reinforcement Learning (trl) is a Python library for training transformer-based language models with reinforcement learning
- BorutaShap: BorutaShap is a Python library for feature selection that combines two powerful techniques: Boruta algorithm and Shapley values method
- EvalML: "EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions." -- Its webiste
- Feature-engine: Feature-engine is a Python library built on top of scikit-learn that offers a wide range of techniques for feature transformation, selection, and extraction
- Optuna: Optuna is an open source hyperparameter optimization framework to automate hyperparameter search of machine learning models
- skrub: "Skrub (formerly dirty_cat) is a Python library that facilitates prepping your tables for machine learning." -- skrub's GitHub repo
- Modin: Modin is an alternative to Pandas that allows for parallel processing on multiple cores and works well with large datasets
- Polars: Polars is another alternative to Pandas. It's fast, and its code is powered by a multithreaded, vectorized query engine written in Rust programming language
- Shapash: Shapash is a Python library for making machine learning models interpretable. It uses Shap or Lime libraries as a backend to compute feature contributions
- Pandera: Pandera is a Python library for testing and validating the data
- Scikit-lego: Scikit-lego is an extension library for Scikit-learn that adds new datasets, transformations, and models on top of Scikit-learn
- Timm: Timm is a computer vision library on top of PyTorch providing computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, etc. to make it easier to work with computer vision tasks
- Ibis: Ibis is dataframe library that supports different backends including DuckDB and Polars
- Trax: Trax is a high-level deep learning library that makes it easy to create end-to-end machine learning pipelines
- Torchvision: Torchvision is a library based on PyTorch that consists of popular datasets, model architectures, and common image transformations for computer vision tasks
- PyPy Python Implementation: A stackless alternative implementation for Python's runtime
- Useful Metrics: A collection of useful ML-related scoring and learning metrics
- XGboost Benchmarks
- Franchise Notebook
- Orange
- Weka: The famous Data Mining tool from where Kiwis live
- ELKI: A Data Mining software framework in Java
- Julia Programming Language: New language for Scientific Computing and HPC
- SQL Notebook
- IPython: An augmented Python shell with lots of features
- Incanter: A statistical analysis environment for a Lisp(for Clojure, to be exact)
- Torch: Scientific Computing framework running on top of Lua's Just in Time compiler, brilliant idea!
- BPython: An advanced Python shell
- RAnalyticFlow: Great environment for Data Flow Programming in R
- SPMF: A Java Data Mining library with tons of excellent algorithms
- SageMath: Open source math software system, a complete math environment for everyone
- H2O AI Platform: A software tool for Big Data Analysis could be used for data mining and machine learning tasks. It has tons of features
- Various ML Cheat Sheets
- OpenRefine: An open-source data cleansing and refinement tool
- Deep Learning Papers
- Apache Mxnet: A high-performance and scalable ANN framework for Deep Learning
- Material for the book 'Python for Data Analysis'
- Encog Machine Learning Framework: An ML library for Java and .NET with a focus on ANN algorithms
- Apache Spark MLib: An ML library on top of your spark cluster!
- Awesome-Python: A comprehensive list of Pythonic resources (libraries, frameworks, etc.)
- GATE: A mature text processing toolkit in Java
- MALLET: "MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text." - MALLET's website
- MLPack: A fast ML library written in C++ with bindings to Python
- t-SNE: Implementation of famous t-distributed stochastic neighbour embedding algorithm for various languages
- Caffe
- Apache Singa
- CompLearn
- SNAP
- Apache PredictionIO
- JGraphT: A Java library for working with graphs with tonnes of features
- JGaphX: A Java library for diagramming and visualizing graphs
- Microsoft Distributed Machine Learning Toolkit
- Microsoft Cognitive Toolkit
- BIDMat: A both CPU and GPU-accelerated matrix library for data mining tasks
- BIDMach
- Apache SystemML
- Apache Mahout
- Accord.NET: Accord.NET is a Machine Learning framework written in C#. Its API is available for .NET, and it also comes combined with some audio and image processing libraries entirely written in C#
- BitMAGIC Library
- Cassovary
- Dex: a friendly tool written in Java for data analysis and data mining
- Apache OpenNLP
- OpenNN: A C++ library to build complex neural network models
- MOA: A tool for mining stream data by people who also created Weka
- MLPACK: C++ Machine Learning library for scalability, speed, and ease-of-use
- MOSES: "Moses is a statistical machine translation system that allows you to train translation models for any language pair automatically." - Moses's website
- Parallel Python: A Python module for parallel execution of code on SMP and Cluster environment
- BeautifulSoup: A handy Python library to digest almost anything from World Wild Web
- Wordbatch: A library for parallel feature extraction on textual data(and potentially other complex data types)
- Mypy: Static typing facilities for Python
- SKIL: A platform for managing the life cycle of an ML/DS-related project or product
- An unofficial Python extension package repository for Windows
- LIBOL: An online learning library
- Smile: "Smile is a fast and comprehensive machine learning system"- Smile's website
- Tablesaw: A daydreamer and visualization library for Java
- TensorFlow Models: A repository of models and examples built with TensorFlow
- Curated list of graph embedding methods: A collection of paper-code pairs for the state of the art graph embeddings(a.k.a network representational learning) algorithms
- Curated list of resources for Recommender Systems
- Pegasus: An open-source system for analyzing huge graphs. It seems it is not being developed or maintained for a long time
- Dataset: A handy tool to simplify the task of reading and writing to relational databases
- Twython: A Twitter API library in pure Python with tonnes of features
- Apache TinkerPop: Excellent graph storage and computation framework. It can be used as a graph analytics platform and a graph database system; love the little gremlins!
- Graphexp: Graphexp is a visual graph explorer with D3.js for TinkerPop
- Scilab: An open-source numerical computation language and environment, a great Matlab alternative
- Glow: A compiler for Neural Network hardware accelerators for various hardware
- GraphJet: A real-time graph processing library in Java
- GraphDrawing: A lovely graph analysis and drawing library in Java
- Sketch Library: A C++ library for data summarization
- The Lemur Project: A collection of search engine, text processing, and data mining tools and libraries in C+, and Java-like RankLib for ranking
- VisPy: A Python library for interactive scientific visualization that is designed to be fast, scalable, and easy to use
- Awesome Machine Learning: A curated list of remarkable Machine Learning frameworks, libraries, software, etc
- MOA Framework: A fantastic Java software environment and framework for Stream Mining
- MEKA: A multi-label classification tool, it works on top of Weka
- Mulan: A Java library for learning on multi-labelled data
- Dlib: A fast Machine Learning library implemented in C++ for solving real-world data problems
- MITE: A library and tool for information extraction on text data, it's built on top of Dlib with binding for languages like Java and Python
- GraphStream: GraphStream is a Java library for analyzing and visualizing dynamic graphs
- Cytoscape: A complex network (graph) visualization tool in Java
- Gephi: A network visualization and analysis tool in Java
- SocNetV: A handy social network visualization tool
- Visone: Yet another handy social network analysis and visualization tool
- Flashlight: A fast Machine Learning library in C++
- Machine Learning with Python: A collection of ML algorithms and their sample use-cases implemented in Python
- TANAGRA: "TANAGRA is a free DATA MINING software for academic and research purposes" its website
- KNIME: KNIME is an open-source data analytics, reporting, and data integration platform
- MG4J: An open-source, high-performance full-text search engine written in Java
- WebGraph: A Java framework for working on massive graphs
- RTree: Reactive implementations of immutable in-memory R-tree and R*-tree in Java
- Recommender Systems: A useful repository of stuff all about the Recommender Systems (e.g. best practices to build Recommender Systems)
- Awesome-Graph: A curated list of resources (e.g., libraries, frameworks, and databases) related to graphs
- Parallel Graph AnalytiX (PGX): A graph processing and analytics toolbox from Oracle which is written in Java
- ROOT: A scientific toolbox for data processing and analysis in C++
- Stanford Topic Modeling Toolbox (TMT): TMT is a friendly Java toolkit for topic modelling on textual data
- Java Data Mining Package: An opensource Java package for mining massive datasets implementing a vast collection of algorithms (i.e., clustering, regression, classification, and graphical models)
- ScalaNLP: A numerical computation and Data Mining library suite written in Scala, with an emphasis on NLP
- Vegas: A very flexible declarative data visualization library in Scala that works with Apache Spark right out of the box
- DeepLearning.scala: A simple Scala library for creating complex artificial neural networks by ThoughtWorks
- XAPIAN: An open-source search engine library with bindings to be used in many high-level programming languages, for example, Python, Java, and Lua!
- DataMelt: "DataMelt is a free software for numeric computation, mathematics, statistics, symbolic calculations, data analysis, and data visualization" - DataMelt's website
- Luna: A functional programming language to create data processing friendly programs in a WYSIWYG way
- NetLogo: A computational multi-agent development and simulation environment, an incredible tool for investigating complex phenomena via implementing simple computational rules for agents!
- LabPlot: LabPlot is a lovely application for data analysis and plotting; it is part of KDE Project!
- Meta Toolkit: A fast software toolkit implementing many useful ML algorithms; it is written in C++
- Record Linkage Tools: A collection of valuable resources for record deduplication and linkage
- Gunrock: A GPU-based graph analytics and processing library, it works with CUDA
- Papers on Graph Analytics: A thorough list of publications related to graphs covering many interesting topics
- GraphIt: GraphIt - "A High-Performance Domain Specific Language for Graph Analytics" - GraphIt's website
- SMORe: A handy tool and library for fast weighted graph embedding in C++
- Warp-ctc: A fast parallel implementation of CTC, for both CPU and GPU
- Grew: Grew is a graph library and tool written in Ocaml with applications in NLP, it is a companion tool for the book Application of Graph Rewriting to Natural Language Processing
- ZVTM: A handy graph visualization library for Java
- mrJob: A Python library to create MapReduce jobs and run them on multiple machines (i.e., in a cluster)
- Metanome: A collection of interesting materials (e.g., algorithms, code, articles) related to data profiling
- Graphillion: Graphilion is a software library for working with many graphs in a parallel fashion
- Awesome graph classification: A very comprehensive collection of graph embedding, classification, and representation learning papers with the code!
- VFML: Very Fast ML (hence the name VFML) is a fast C library for mining very massive data streams
- Talisman: Talisman is a modular JavaScript library for NLP and Machine Learning activities
- StyleGAN: StyleGAN is the TensorFlow implementation of a proposed architecture for GANs from NVIDIA. You can use it to create photo-realistic pictures of people who don't exist!
- Java String Similarity: A Java library implementing a collection of helpful text similarity/distance measures
- Label Studio: Label Studio is a handy tool with a friendly UI for labelling your data (e.g., records and documents)
- GraphML: GraphML is a graph representation and serialization file format based on XML that could store many different types of graphs with their attributes without loss of information
- Taco: A compiler for compiling and executing general tensor algebra operations on sparse tensors in machine code for CPUs and GPUs
- Libspatialindex: Libspatialindex contains many robust geolocational indexing algorithms like R*-tree and TPR-tree
- NLP Best Practices: A collection of best practices and their examples in the NLP domain from Microsoft
- Tulip: Tulip is an excellent open-source data visualization and analysis software toolbox, it is perfect for working with graphs and graph datasets
- Juno: Juno is an IDE based on Atom for Julia programming language
- BoofCV: A real-time machine vision and image processing in Java
- cuDF: cuDF is a library with API similar to Pandas that is built based on the Apache Arrow columnar memory format; cudf uses GPU routines for loading, joining, aggregating, filtering, and otherwise manipulating data
- LASER toolkit: LASER (Language-Agnostic SEntence Representations) is a software toolkit for sentence embedding for about 100 different languages
- Idyll: "A toolkit for creating data-driven stories and explorable explanations" - Idyll's website
- DeepLearning4J: A java-based software toolbox for building and training deep artificial neural networks
- NeMo: NeMo is a software toolkit for building AI applications
- TRAINS Agent: TRAINS Agent is a DevOps tool for setting up and running an AI experiment on a cluster computing environment
- TensorFlow Hub: TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of deep learning models
- AIX360: An explainable AI (XAI) toolkit to interpret Machine Learning models
- Catalyst: Catalyst is a tool for making Deep Learning experiments on PyTorch reproducible
- TensorFlowJS: TensorFlowJS is a JavaScript library to use TensorFlow models in web applications in the browser
- Kst: Kst is a handy data visualization tool from the KDE project
- AMIDST: AMIDST is a Java software toolbox for probabilistic modelling of data
- LIBFFM: "LIBFFM is an open-source tool for field-aware factorization machines (FFM)"; people won a few real-world data science challenges in Kaggle
- jLDADMM: A Java package for LDA and DMM topic modelling
- Stan: "Stan is a state-of-the-art platform for statistical modelling and high-performance statistical computation." - Stan's website
- DEAP: Distributed Evolutionary Algorithms in Python
- Stanford CoreNLP
- SimMetrics
- Neuroph
- MLeap
- HiSee
- JSAT: A Java-based library implementing a standard set of data analysis and machine learning activities
- Ark tweet Pos tagger: CMU ARK Twitter part-of-speech tagger
- JASP
- Jamovi
- DynaML: "DynaML is a Scala & JVM Machine Learning toolbox for research, education & industry." -- Its website
- ExecuteMulan: A Java utility to run the multi-label classification method from Mulan with more ease
- GTN: "GTN is an open-source framework for automatic differentiation with a powerful, expressive type of graph called weighted finite-state transducers (WFSTs). Just as PyTorch provides a framework for automatic differentiation with tensors, GTN provides such a framework for WFSTs. AI researchers and engineers can use GTN to train graph-based machine learning models more effectively." -- Facebook
- Tribuo: An open-source machine learning library in Java from Oracle
- Neo4J's Graph Data Science Library
- Libbow: "Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modelling, and information retrieval programs." -- Its website
- Doccano
- FACTORIE
- BigDL: BigDL is a deep learning library that runs in an Apache Spark cluster
- ImageJ: "ImageJ is an open-source image processing program designed for multidimensional scientific images."--ImageJ's website
- OjAlgo: Oj! Algorithms is a pure-Java linear algebra and mathematical optimization library
- Ivy: Ivy is a unifying framework for different deep learning frameworks such as PyTorch and TensorFlow. You need to write your code in Ivy once and run it on many deep learning frameworks
- Gradio: Gradio is a user interface (UI) framework for building UIs for machine learning and data science applications in Python
- Label Studio: Label Studio is a tool for annotating and labeling datasets
- VisiData: "A terminal interface for exploring and arranging tabular data." -- on VisiData's GitHub repository
- Optimum: "Optimum is an extension of Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware." -- Optimum's GitHub page
- CASL Project: An umbrella project including a wide range of machine learning tools aiming at simplicity and scabality
- sparse_dot_topn: A Python library for fast sparse matrix multiplication and top-n (or top-k) similarity search
- Streamlit: Streamlit is a Python library enabling users to create shareable web apps directly from their data scripts in minutes. It requires no front-end experience, making it easy to use for anyone familiar with Python
- LangChain: LangChain a framework for developing applications powered by language models
- Zarr: "Zarr is a Python package providing an implementation of compressed, chunked, N-dimensional arrays, designed for use in parallel computing" -- Zarr's GitHub page
- Netron: Netron is a visualization tool for viewing the architecture of machine learning models
- Prophet: Prophet is a tool for time series forecasting from Facebook; it can handle missing values, outliers, and large datasets
- Deep Java Library (DJL): DJL is a high-level open-source Java framework for building deep learning models
- Burn: Burn is a deep learning framework written in Rust. It can be used for model experimentation, training, and deployment for both researchers and real-world use cases
- Faiss: Faiss is a C++ library with Python wrappers for fast vector similarity search and clustering
- MLX: MLX is an array framework for machine learning on Apple silicon system on a chip processors from Apple
- Vega & Vega-Lite: Vega and Vega-Lite are two libraries for creating interactive data visualizations. Vega gives more control to users, while Vega-Lite is more high-level and easier to use
- DSPy: "DSPy is a framework for algorithmically optimizing LM prompts and weights, especially when LMs are used one or more times within a pipeline." -- text in DSPy's GitHub repo
- Papermill: Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks
- CTGAN: "CTGAN is a collection of Deep Learning based synthetic data generators for single table data" -- CTGAN's GitHub repository
- Ali Rahimi's talk NIPS 2017: Good talk from someone inside the field
- Procrustes: How could we live without Wikipedia?
- Probably Approximately Correct
- Foundations of Machine Learning: An excellent book to start learning ML, A must for every ML enthusiast
- Scikit-Learn website: Scikit-learn's website itself is a great resource to learn!
- What Computers Still Can't Do: Some old and still valid criticisms of Strong AI! Are AI and Alchemy the same?
- Readings in Database Systems(The Red Book): An enjoyable to read. It's a little bit hard to follow at first for me, but a great many resources are mentioned at the end of each chapter, and it gives significant insights into the history, trends, and future of DBMSs and Data Processing Platforms
- Kolmogorov Complexity: Let's compress everything!
- Machine Learning Meets Databases: A very informative and also easy to follow article, including a short introduction to Machine Learning and also describing its relation to Data Mining and Databases
- A gentle introduction to Tensors and their uses: An introduction to Tensors and their sample applications, Don't let the math scare you off!:0)
- Mining Massive Datasets: A lovely blend of theory and application for what can be done to data
- Networks, Crowds, and Markets: Reasoning About a Highly Connected World : Very insightful if you like to know more about the interconnected world and networks