This repository contains code for predicting how small molecules change gene expression in different cell types. The goal is to accelerate drug discovery and basic biology research by developing methods to accurately predict chemical perturbations in new cell types.
The dataset used for this project can be found here. It includes various data files, such as adata_obs_meta.csv
, adata_train.parquet
, de_train.parquet
, and more. These files are used for training and evaluation.
- Python 3.7 or higher
- LightGBM
- Scikit-learn
- PyArrow
-
Clone the repository:
git clone https://github.com/spmfte/single-cell-gene-expression-prediction.git
-
Navigate to the project directory:
cd single-cell-gene-expression-prediction
-
Install the required packages:
pip install -r requirements.txt
-
Run the Jupyter notebook
pert30.ipynb
for a step-by-step walkthrough of the project. -
Modify the code as needed for your specific use case and dataset.
In the Jupyter notebook, we perform data exploration, visualize the dataset, and analyze its characteristics.
We preprocess the data by handling missing values and scaling the features to prepare it for model development.
We train a LightGBM regression model to predict chemical perturbations' impact on gene expression in different cell types.
The model's performance is evaluated using root mean squared error (RMSE) on a test dataset.