Materials and problem sets for the course Machine Learning for Molecular Engineering (Spring 2023) taught at MIT.
Instructors: Prof. Connor Coley and Prof. Rafael Gomez-Bombarelli
Teaching Assistants: Soojung Yang and Yitong Tseo
Course Numbers: 3.C01/3.C51, 10.C01/10.C51, 20.C01/20.C51
Ungraded problem set (no submission) to practice using Google Colab and numpy.
Data size: ~10^2
Basic linear classification problem to get you started for the course. You will use logistic regression to diagnose cancer. You will apply linear methods with L1 and L2 regularization and understand what effects they have on your regression results. You also will experiment with hyperparameter optimization to tune your model with cross-validation.
Data size: ~10^3
You will apply a MLP regressor to predict properties of perovskites. You will compare differences between different representations of the chemical composition of a perovskite crystal. You will also use hyperopt to perform hyperparameter search for your MLP architecture.
Data size: ~10^3
You will apply an MLP regressor to predict MHC binding to peptides. You will compare differences between different representations of the amino acid composition of a peptide. You will also use hyperopt to perform hyperparameter search for your MLP architecture.
Data size: ~10^4
This problem set has two parts: 1) In the first part, you will use PyTorch to train a LSTM-based classifier to classify DNA binding sites. 2) In the second part, you will try to reduce a high-dimensional dataset into lower dimensions with PCA and t-SNE. You will try to find out if the obtained low-dimensional embedding is meaningful.
This problem set is an application of computer vision to molecular engineering. You will use a deep learning model to classify steel surface defects and perform image segmentation to identify cell nuclei.
Data size: ~10^6
This problem set will be more meaty than the previous ones. You will implement your own Graph Neural Nets to predict molecular properties and train a Variational Auto-Encoder to generate new molecules from a learned hidden continuous representation.
Data size: ~10^3
This problem set is an application of ML to predicting Solvation Free Energies and participating in a private kaggle dataset challenge.
Data size: ~10^3
You will complete a short clustering exercise and participate in a ML competition to predict drug screen.
Data size: ~10^3
You will participate in a ML competition to predict solvation free energies of solute/solvent pairs.