RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design

[2024-08-21] News: We provided a comprehensive evaluation system for RNA sequence design and prediction named R3Design. APIs and Colab demos are also provided. Feel free to check out our new repo!

[2024-08-15] Update: Thank you all for the interests and inquries about our paper, we are sorry that we haven't provided detailed documentation and demo of the paper for such a long time. Now, it has been solved. Feel free to check out our updated documentation and colab! :)

Introduction

While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity.

In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks.

Dataset

We carefully collected representative RNA tertiary structure data from two sources, RNAsolo and the Protein Data Bank (PDB). The refined data has been released here. Please download the datasets and organize them as follows.

RDesign
├── API
├── assets
├── checkpoints
├── methods
├── model
└── data
    ├── RNAsolo
    │   ├── train_data.pt
    │   ├── val_data.pt
    │   ├── test_data.pt

Main Environment

cd RDesign
conda env create -f environment.yml
conda activate RDesign

Load Data

# If you want to see the details inside our dataset, you could use Pickle package from Python
import _pickle as cPickle
train_data = cPickle.load(open('data/train_data.pt', 'rb'))
print(train_data[0].keys())

#For external datasets, loading data could be in this way:
from API.rpuzzles_dataset import RPuzzlesDataset
rfam_dataset = RPuzzlesDataset('./data/rfam_data.pt')
rpuz_dataset = RPuzzlesDataset('./data/rpuz_data.pt')

Test the model

# For more details, please refer to the colab
# We provided detailed functions and pipeline to show how our model operates

Colab Link:

Citation

If you are interested in our repository and our paper, please cite the following paper:

@inproceedings{tan2024rdesign,
  title={RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design},
  author={Tan, Cheng and Zhang, Yijie and Gao, Zhangyang and Hu, Bozhen and Li, Siyuan and Liu, Zicheng and Li, Stan Z},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

Feedback

If you have any issue about this work, please feel free to contact me by email:

Cheng Tan: tancheng@westlake.edu.cn
Yijie Zhang: yj.zhang@mail.mcgill.ca

License

This project is released under the Apache 2.0 license. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
API		API
assets		assets
checkpoints		checkpoints
methods		methods
model		model
.gitignore		.gitignore
environment.yml		environment.yml
example.pdb		example.pdb
main.py		main.py
parser.py		parser.py
readme.md		readme.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design

Introduction

Dataset

Main Environment

Load Data

Test the model

Citation

Feedback

License

About

Releases 1

Packages

Contributors 2

Languages

A4Bio/RDesign

Folders and files

Latest commit

History

Repository files navigation

RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design

Introduction

Dataset

Main Environment

Load Data

Test the model

Citation

Feedback

License

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages