Skip to content

Commit

Permalink
Merge pull request #12 from steineggerlab/master
Browse files Browse the repository at this point in the history
Sync with master
  • Loading branch information
khb7840 authored Oct 13, 2022
2 parents 8f7c71a + 21d44d8 commit c00c0ea
Show file tree
Hide file tree
Showing 19 changed files with 184 additions and 88 deletions.
Binary file removed .github/img/Abstract.jpg
Binary file not shown.
Binary file added .github/img/foldcomp_strong_marv.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/img/format_benchmark_dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/img/format_benchmark_light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 6 additions & 1 deletion .github/workflows/build-static.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,12 @@ jobs:
./foldcomp compress ../test/test.pdb
./foldcomp decompress ../test/test.fcz
./foldcomp compress ../test/test.cif.gz
./foldcomp decompress ../test/test.cif.fcz
./foldcomp decompress -a ../test/test.cif.fcz
RMSD1=$(./foldcomp rmsd ../test/test.pdb ../test/test_fcz.pdb | cut -f6)
awk -v check=$RMSD1 -v target=0.102262 'BEGIN { diff = check - target; if (diff < 0) diff = -diff; if (diff > 0.001) { print check"!="target; exit 1 } }'
RMSD2=$(./foldcomp rmsd ../test/test.cif.gz ../test/test.cif_fcz.pdb | cut -f6)
awk -v check=$RMSD2 -v target=0.144428 'BEGIN { diff = check - target; if (diff < 0) diff = -diff; if (diff > 0.001) { print check"!="target; exit 1 } }'
echo "All good!"
compile:
strategy:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ on:
branches: [ "master", "dev" ]
pull_request:
branches: [ "master", "dev" ]
release:
types: [published]

jobs:
lint:
Expand Down
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,20 @@
# Foldcomp
Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, thus an averaged-sized protein of 350 residues requires ~4.2kb.
<p align="center">
<img src="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/foldcomp_strong_marv.png" max-height="300px" height="300" display="block" margin-left="auto" margin-right="auto" display="block"/>
</p>
Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, thus an averaged-sized protein of 350 residues requires ~6kb.

<p align="center"><img src="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/Abstract.jpg" height="400" /></p>

Foldcomp is a compression method and format to compress protein structures requiring only 13 bytes per residue, which reduces the required storage space by an order of magnitude compared to saving 3D coordinates directly. We achieve this reduction by encoding the torsion angles of the backbone as well as the side-chain angles in a compact binary file format, FCZ.
Foldcomp efficient compressed format stores protein structures requiring only 13 bytes per residue, which reduces the required storage space by an order of magnitude compared to saving 3D coordinates directly. We achieve this reduction by encoding the torsion angles of the backbone as well as the side-chain angles in a compact binary file format (FCZ).

> Foldcomp currently only supports compression of single chain PDB files
<br clear="right"/>
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/format_benchmark_dark.png">
<img src="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/format_benchmark_light.png" alt="Left panel: Foldcomp data format, saving amino acid residue in 13 byte. Top right panel: Foldcomp decompression is as fast as gzip. Bottom right panel: Foldcomp compression ratio is higher than pulchra and gzip." max-width="720px" max-height="400px" width="auto" height="auto">
</picture>
</p>

## Usage

Expand Down Expand Up @@ -43,11 +52,14 @@ foldcomp extract [--plddt|--fasta] [-t number] <fcz_dir|tar> [<output_dir>]
foldcomp check <fcz_file>
foldcomp check [-t number] <fcz_dir|tar>
# RMSD
foldcomp rmsd <pdb1|cif1> <pdb2|cif2>
# Options
-h, --help print this help message
-t, --threads threads for (de)compression of folders/tar files [default=1]
-a, --alt use alternative atom order [default=false]
-b, --break interval size to save absolute atom coordinates [default=200]
-b, --break interval size to save absolute atom coordinates [default=25]
-z, --tar save as tar file [default=false]
--plddt extract pLDDT score (only for extraction mode)
--fasta extract amino acid sequence (only for extraction mode)
Expand Down Expand Up @@ -88,3 +100,4 @@ with foldcomp.open("test/example_db", ids=ids) as db:
<a href="https://github.com/steineggerlab/foldcomp/graphs/contributors">
<img src="https://contributors-img.firebaseapp.com/image?repo=steineggerlab/foldcomp" />
</a>

12 changes: 10 additions & 2 deletions build.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/bin/sh -e
# File: build.sh
# Project: foldcomp
# Created: 2022-05-30 17:11:11
# Author: Hyunbin Kim (khb7840@gmail.com)
# Description:
# Build script for foldcomp.
# ---
# Last Modified: 2022-09-20 11:52:22
# Last Modified: 2022-10-07 17:48:48
# Modified By: Hyunbin Kim (khb7840@gmail.com)
# ---
# Copyright © 2022 Hyunbin Kim, All rights reserved
Expand All @@ -26,4 +27,11 @@ cmake --build ./build --target foldcomp
./build/foldcomp decompress ./test/compressed.fcz ./test/decompressed.pdb
# Test - Gzipped CIF
./build/foldcomp compress ./test/test.cif.gz ./test/compressed_cif.fcz
./build/foldcomp decompress ./test/compressed_cif.fcz ./test/decompressed_cif.pdb
./build/foldcomp decompress -a ./test/compressed_cif.fcz ./test/decompressed_cif.pdb
# RMSD
RMSD1=$(./build/foldcomp rmsd ./test/test.pdb ./test/decompressed.pdb | cut -f6)
awk -v check=$RMSD1 -v target=0.102262 'BEGIN { diff = check - target; if (diff < 0) diff = -diff; if (diff > 0.001) { print check"!="target; exit 1 } }'
RMSD2=$(./build/foldcomp rmsd ./test/test.cif.gz ./test/decompressed_cif.pdb | cut -f6)
awk -v check=$RMSD2 -v target=0.144428 'BEGIN { diff = check - target; if (diff < 0) diff = -diff; if (diff > 0.001) { print check"!="target; exit 1 } }'

echo "All good!"
20 changes: 16 additions & 4 deletions foldcomp-py-examples.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/steineggerlab/foldcomp/blob/master/foldcomp-py-examples.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
Expand All @@ -17,7 +27,7 @@
"cell_type": "code",
"source": [
"# Installing foldcomp\n",
"!pip install -q \"foldcomp @ git+https://github.com/steineggerlab/foldcomp\""
"!pip install -q \"foldcomp==0.0.2\""
],
"metadata": {
"id": "37KdADYrKHyI"
Expand Down Expand Up @@ -58,8 +68,9 @@
"with open(\"test.pdb\", \"r\") as f:\n",
" original = f.read()\n",
"\n",
"# Compress give\n",
"fcz = foldcomp.compress(\"test.pdb\", original)\n",
"# Compress input with reset points every 25 residues\n",
"# Should give a RMSD ~0.07A. A reset point every 200 residues will give a RMSD ~0.2A\n",
"fcz = foldcomp.compress(\"test.pdb\", original, anchor_residue_threshold=25)\n",
"\n",
"# Decompress again\n",
"(name, pdb) = foldcomp.decompress(fcz)\n",
Expand Down Expand Up @@ -198,7 +209,8 @@
"metadata": {
"colab": {
"collapsed_sections": [],
"provenance": []
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
Expand Down
2 changes: 1 addition & 1 deletion foldcomp/foldcomp.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ static PyObject *foldcomp_compress(PyObject* /* self */, PyObject *args, PyObjec
return NULL;
}

int threshold = 200;
int threshold = DEFAULT_ANCHOR_THRESHOLD;
if (anchor_residue_threshold != NULL) {
threshold = PyLong_AsLong(anchor_residue_threshold);
}
Expand Down
2 changes: 1 addition & 1 deletion src/amino_acid.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ class AminoAcid {
}
};

std::map<std::string, AminoAcid> AminoAcids() {
static std::map<std::string, AminoAcid> AminoAcids() {
std::map<std::string, AminoAcid> output;
output.emplace("ALA", AminoAcid('A', "ALA", "Alanine", // name
{"N", "CA", "C", "O", "CB",}, // atoms
Expand Down
34 changes: 23 additions & 11 deletions src/atom_coordinate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* Description:
* The data type to handle atom coordinate comes here.
* ---
* Last Modified: 2022-09-29 17:30:10
* Last Modified: 2022-10-06 19:59:30
* Modified By: Hyunbin Kim (khb7840@gmail.com)
* ---
* Copyright © 2021 Hyunbin Kim, All rights reserved
Expand Down Expand Up @@ -69,11 +69,11 @@ bool AtomCoordinate::operator!=(const AtomCoordinate& other) const {
return !(*this == other);
}

bool AtomCoordinate::isBackbone() {
bool AtomCoordinate::isBackbone() const {
return ((this->atom == "N") ||(this->atom == "CA") ||(this->atom == "C"));
}

void AtomCoordinate::print(int option) {
void AtomCoordinate::print(int option) const {
std::cout << "Atom: " << this->atom << std::endl;
if (option != 0) {
std::cout << "Residue: " << this->residue << std::endl;
Expand Down Expand Up @@ -127,16 +127,16 @@ std::vector<AtomCoordinate> extractChain(
}

void printAtomCoordinateVector(std::vector<AtomCoordinate>& atoms, int option) {
for (AtomCoordinate curr_atm : atoms) {
for (const AtomCoordinate& curr_atm : atoms) {
curr_atm.print(option);
}
}

std::vector<AtomCoordinate> filterBackbone(std::vector<AtomCoordinate>& atoms) {
std::vector<AtomCoordinate> filterBackbone(const std::vector<AtomCoordinate>& atoms) {
std::vector<AtomCoordinate> output;
for (AtomCoordinate curr_atm : atoms) {
for (const AtomCoordinate& curr_atm : atoms) {
if (curr_atm.isBackbone()) {
output.push_back(curr_atm);
output.emplace_back(curr_atm);
}
}
return output;
Expand Down Expand Up @@ -168,7 +168,7 @@ void reverse(char* s) {
s[i] = s[j];
s[j] = c;
}
}
}

void itoa_pos_only(int n, char* s) {
int i = 0;
Expand Down Expand Up @@ -373,11 +373,11 @@ std::vector<AtomCoordinate> getAtomsWithResidueIndex(
std::vector<std::string> atomNames
) {
std::vector<AtomCoordinate> output;
for (AtomCoordinate curr_atm : atoms) {
for (const AtomCoordinate& curr_atm : atoms) {
if (curr_atm.residue_index == residue_index) {
for (std::string atom_name : atomNames) {
for (const std::string& atom_name : atomNames) {
if (curr_atm.atom == atom_name) {
output.push_back(curr_atm);
output.emplace_back(curr_atm);
}
}
}
Expand All @@ -396,3 +396,15 @@ std::vector< std::vector<AtomCoordinate> > getAtomsWithResidueIndex(
}
return output;
}

float RMSD(std::vector<AtomCoordinate>& atoms1, std::vector<AtomCoordinate>& atoms2) {
// RMSD: Root Mean Square Deviation
float sum = 0;
// Sum of square of distance
for (size_t i = 0; i < atoms1.size(); i++) {
sum += pow(atoms1[i].coordinate.x - atoms2[i].coordinate.x, 2);
sum += pow(atoms1[i].coordinate.y - atoms2[i].coordinate.y, 2);
sum += pow(atoms1[i].coordinate.z - atoms2[i].coordinate.z, 2);
}
return sqrt(sum / atoms1.size());
}
9 changes: 5 additions & 4 deletions src/atom_coordinate.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* Description:
* The data type to handle atom coordinate comes here.
* ---
* Last Modified: 2022-09-27 11:50:15
* Last Modified: 2022-10-06 19:54:30
* Modified By: Hyunbin Kim (khb7840@gmail.com)
* ---
* Copyright © 2021 Hyunbin Kim, All rights reserved
Expand Down Expand Up @@ -48,8 +48,8 @@ class AtomCoordinate {
//BackboneChain toCompressedResidue();

//methods
bool isBackbone();
void print(int option = 0);
bool isBackbone() const;
void print(int option = 0) const ;
void setTempFactor(float tf) { this->tempFactor = tf; };
};

Expand All @@ -70,7 +70,7 @@ std::vector<AtomCoordinate> extractChain(
std::vector<AtomCoordinate>& atoms, std::string chain
);

std::vector<AtomCoordinate> filterBackbone(std::vector<AtomCoordinate>& atoms);
std::vector<AtomCoordinate> filterBackbone(const std::vector<AtomCoordinate>& atoms);

void printAtomCoordinateVector(std::vector<AtomCoordinate>& atoms, int option = 0);

Expand Down Expand Up @@ -106,6 +106,7 @@ std::vector< std::vector<AtomCoordinate> > getAtomsWithResidueIndex(
std::vector<AtomCoordinate>& atoms, std::vector<int> residue_index,
std::vector<std::string> atomNames = {"N", "CA", "C"}
);
float RMSD(std::vector<AtomCoordinate>& atoms1, std::vector<AtomCoordinate>& atoms2);

template <int32_t T, int32_t P>
void ftoa(float n, char* s);
2 changes: 1 addition & 1 deletion src/dbreader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ void* make_reader(const char *data_name, const char *index_name, int32_t data_mo
fclose(file);
}

char cache_name[FILENAME_MAX];
// char cache_name[FILENAME_MAX];
// if ((data_mode & DB_READER_NO_CACHE) == 0) {
// sprintf(cache_name, "%s.cache.%d", index_name, data_mode);

Expand Down
35 changes: 16 additions & 19 deletions src/foldcomp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
*/
#include "foldcomp.h"

#include "amino_acid.h"
#include "sidechain.h"
#include "torsion_angle.h"
#include "utility.h"
Expand Down Expand Up @@ -275,7 +276,7 @@ int reconstructBackboneReverse(
int discretizeSideChainTorsionAngles(
std::vector< std::vector<float> >& torsionPerResidue,
std::vector<std::string>& residueNames,
std::map<std::string, AminoAcid>& AAS,
const std::map<std::string, AminoAcid>& AAS,
SideChainDiscretizers& scDiscretizers,
std::map<std::string, std::vector<Discretizer> >& scDiscretizersMap,
std::vector<unsigned int>& output
Expand Down Expand Up @@ -341,28 +342,23 @@ int continuizeSideChainTorsionAngles(
std::map<std::string, std::vector<Discretizer> >& scDiscretizersMap,
std::vector< std::vector<float> >& output
) {
// Declare
int success = 0;
std::string currResidue;
int currResidueTorsionNum = 0;
int currIndex = 0;
float currTorsion = 0;

scDiscretizersMap = initializeSideChainDiscMap();
success = fillSideChainDiscretizerMap(scDiscretizers, scDiscretizersMap);
int success = fillSideChainDiscretizerMap(scDiscretizers, scDiscretizersMap);
std::vector< std::vector<float> > torsionPerResidue;
std::vector<float> currTorsionVector;
FixedAngleDiscretizer currDiscretizer = FixedAngleDiscretizer(pow(2, NUM_BITS_TEMP) - 1);
FixedAngleDiscretizer currDiscretizer(pow(2, NUM_BITS_TEMP) - 1);
// Iterate
int currIndex = 0;
for (size_t i = 0; i < residueNames.size(); i++) {
currResidue = residueNames[i];
currResidueTorsionNum = getSideChainTorsionNum(currResidue);
std::string currResidue = residueNames[i];
int currResidueTorsionNum = getSideChainTorsionNum(currResidue);
currTorsionVector.clear();
currTorsionVector.resize(currResidueTorsionNum);
for (int j = 0; j < currResidueTorsionNum; j++) {
// Get current torsion angle vector
//currTorsion = scDiscretizersMap[currResidue][j].continuize(torsionDiscretized[currIndex]);
currTorsion = currDiscretizer.continuize(torsionDiscretized[currIndex]);
float currTorsion = currDiscretizer.continuize(torsionDiscretized[currIndex]);
currTorsionVector[j] = currTorsion;
currIndex++;
}
Expand Down Expand Up @@ -1425,18 +1421,18 @@ void Foldcomp::printSideChainTorsion(std::string filename) {
std::map<std::string, float> currTorsionAngle;

for (int i = 0; i < this->nResidue; i++) {
currBondAngle = calculateBondAngles(atomByResidue[i], this->AAS[this->residueThreeLetter[i]]);
currBondLength = calculateBondLengths(atomByResidue[i], this->AAS[this->residueThreeLetter[i]]);
currBondAngle = calculateBondAngles(atomByResidue[i], this->AAS.at(this->residueThreeLetter[i]));
currBondLength = calculateBondLengths(atomByResidue[i], this->AAS.at(this->residueThreeLetter[i]));
for (const auto& bl: currBondLength) {
outfile << i << "," << this->residueThreeLetter[i] << ",BondLength,";
outfile << bl.first << "," << bl.second << ",NA,NA,NA,";
outfile << this->AAS[this->residueThreeLetter[i]].bondLengths[bl.first] << ",";
outfile << bl.second - this->AAS[this->residueThreeLetter[i]].bondLengths[bl.first] << "\n";
outfile << this->AAS.at(this->residueThreeLetter[i]).bondLengths.at(bl.first) << ",";
outfile << bl.second - this->AAS.at(this->residueThreeLetter[i]).bondLengths.at(bl.first) << "\n";
}
for (const auto& ba : currBondAngle) {
outfile << i << "," << this->residueThreeLetter[i] << ",BondAngle,";
outfile << ba.first << "," << ba.second << ",NA,NA,NA," << this->AAS[this->residueThreeLetter[i]].bondAngles[ba.first] << ",";
outfile << ba.second - this->AAS[this->residueThreeLetter[i]].bondAngles[ba.first] << "\n";
outfile << ba.first << "," << ba.second << ",NA,NA,NA," << this->AAS.at(this->residueThreeLetter[i]).bondAngles.at(ba.first) << ",";
outfile << ba.second - this->AAS.at(this->residueThreeLetter[i]).bondAngles.at(ba.first) << "\n";
}
for (size_t j = 0; j < this->sideChainAnglesPerResidue[i].size(); j++) {
outfile << i << "," << this->residueThreeLetter[i] << ",TorsionAngle,";
Expand Down Expand Up @@ -1528,7 +1524,7 @@ void printValidityError(ValidityError err, std::string& filename) {
}
}

void _reorderAtoms(std::vector<AtomCoordinate>& atoms, AminoAcid& aa) {
void _reorderAtoms(std::vector<AtomCoordinate>& atoms, const AminoAcid& aa) {
std::vector<AtomCoordinate> newAtoms = atoms;
for (size_t i = 0; i < atoms.size(); i++) {
if (atoms[i].atom == aa.altAtoms[i]) {
Expand Down Expand Up @@ -1774,3 +1770,4 @@ int getSideChainTorsionNum(std::string residue) {
return out;
}

const std::map<std::string, AminoAcid> Foldcomp::AAS = AminoAcid::AminoAcids();
Loading

0 comments on commit c00c0ea

Please sign in to comment.