In this project we apply clustering techniques to cluster data from the BBC dataset. Our goal is to identify patterns of news without using labels. The approach we will follow for EM in this project follows the work developed by professor Gholamreza Haffari from Monash University.
From rapidminer The EM (expectation maximization) technique is similar to the K-Means technique. The basic operation of K-Means clustering algorithms is relatively simple: Given a fixed number of k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible. The EM algorithm extends this basic approach to clustering by assigning examples to clusters to maximize the differences in means for continuous variables.The EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.
A public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech. The dataset is broken into 1490 records for training and 735 for testing. Available in kaggle
We achieve good results with similar patterns to those of the labelled data.