In this workshop, we are going to learn how to go through the process of doing machine learning on a set of data. To do so, we will download a corpus of text data to work with, extract features from this data, and do supervised machine learning to our data, using a mathmatical algorithm to train a classifier which will then classify previously unseen data into a set of predefined categories.
Machine learning is a research field that sits at the intersection of statistics, artificial intelligence, and computer science. It is also known as predictive analytics or statistical learning.1
machine learning: an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
corpus: a large collection of data. In our case, this will be text data (although a corpus can contain any type of data)
dataset: a collection of related information (such as a corpus)
- variable: an attribute of the dataset (such as the type of text being analyzed)
- observation: an entry in the dataset (a single text)
- measurement: a single data point (e.g., one text's type)
features: properties that describe data attributes for machine learning—often the variables
feature representation, feature vector: a set of features
supervised machine learning: a machine learning task of learning a function that maps an input to an output based on example input-output pairs
unsupervised machine learning: a machine learning task used to draw inferences from datasets consisting of input data without labelled responses (lacks input-output pairs; only has input data)
algorithm: a process or set of rules to be followed in calculations (or other problem-solving operations), particularly by a computer
classification: a machine learning task used to predict a class label, which is a choice from a predefined list of possibilities
1 Andreas Mueller, Introduction to Machine Learning with Python.