Skip to content

Commit

Permalink
Merge pull request steineggerlab#12 from steineggerlab/master
Browse files Browse the repository at this point in the history
Sync with master
  • Loading branch information
khb7840 authored Oct 13, 2022
2 parents 5c71c86 + fa50aa0 commit 619f5b2
Show file tree
Hide file tree
Showing 10 changed files with 149 additions and 80 deletions.
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,20 @@
# Foldcomp
Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, thus an averaged-sized protein of 350 residues requires ~4.2kb.
<p align="center">
<img src="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/foldcomp_strong_marv.png" max-height="300px" height="300" display="block" margin-left="auto" margin-right="auto" display="block"/>
</p>
Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, thus an averaged-sized protein of 350 residues requires ~6kb.

<p align="center"><img src="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/Abstract.jpg" height="400" /></p>

Foldcomp is a compression method and format to compress protein structures requiring only 13 bytes per residue, which reduces the required storage space by an order of magnitude compared to saving 3D coordinates directly. We achieve this reduction by encoding the torsion angles of the backbone as well as the side-chain angles in a compact binary file format, FCZ.
Foldcomp efficient compressed format stores protein structures requiring only 13 bytes per residue, which reduces the required storage space by an order of magnitude compared to saving 3D coordinates directly. We achieve this reduction by encoding the torsion angles of the backbone as well as the side-chain angles in a compact binary file format (FCZ).

> Foldcomp currently only supports compression of single chain PDB files
<br clear="right"/>
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/format_benchmark_dark.png">
<img src="https://raw.githubusercontent.com/steineggerlab/foldcomp/master/.github/img/format_benchmark_light.png" alt="Left panel: Foldcomp data format, saving amino acid residue in 13 byte. Top right panel: Foldcomp decompression is as fast as gzip. Bottom right panel: Foldcomp compression ratio is higher than pulchra and gzip." max-width="720px" max-height="400px" width="auto" height="auto">
</picture>
</p>

## Usage

Expand Down Expand Up @@ -43,11 +52,14 @@ foldcomp extract [--plddt|--fasta] [-t number] <fcz_dir|tar> [<output_dir>]
foldcomp check <fcz_file>
foldcomp check [-t number] <fcz_dir|tar>
# RMSD
foldcomp rmsd <pdb1|cif1> <pdb2|cif2>
# Options
-h, --help print this help message
-t, --threads threads for (de)compression of folders/tar files [default=1]
-a, --alt use alternative atom order [default=false]
-b, --break interval size to save absolute atom coordinates [default=200]
-b, --break interval size to save absolute atom coordinates [default=25]
-z, --tar save as tar file [default=false]
--plddt extract pLDDT score (only for extraction mode)
--fasta extract amino acid sequence (only for extraction mode)
Expand Down Expand Up @@ -88,3 +100,4 @@ with foldcomp.open("test/example_db", ids=ids) as db:
<a href="https://github.com/steineggerlab/foldcomp/graphs/contributors">
<img src="https://contributors-img.firebaseapp.com/image?repo=steineggerlab/foldcomp" />
</a>

2 changes: 1 addition & 1 deletion src/amino_acid.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ class AminoAcid {
}
};

std::map<std::string, AminoAcid> AminoAcids() {
static std::map<std::string, AminoAcid> AminoAcids() {
std::map<std::string, AminoAcid> output;
output.emplace("ALA", AminoAcid('A', "ALA", "Alanine", // name
{"N", "CA", "C", "O", "CB",}, // atoms
Expand Down
34 changes: 23 additions & 11 deletions src/atom_coordinate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* Description:
* The data type to handle atom coordinate comes here.
* ---
* Last Modified: 2022-09-29 17:30:10
* Last Modified: 2022-10-06 19:59:30
* Modified By: Hyunbin Kim (khb7840@gmail.com)
* ---
* Copyright © 2021 Hyunbin Kim, All rights reserved
Expand Down Expand Up @@ -69,11 +69,11 @@ bool AtomCoordinate::operator!=(const AtomCoordinate& other) const {
return !(*this == other);
}

bool AtomCoordinate::isBackbone() {
bool AtomCoordinate::isBackbone() const {
return ((this->atom == "N") ||(this->atom == "CA") ||(this->atom == "C"));
}

void AtomCoordinate::print(int option) {
void AtomCoordinate::print(int option) const {
std::cout << "Atom: " << this->atom << std::endl;
if (option != 0) {
std::cout << "Residue: " << this->residue << std::endl;
Expand Down Expand Up @@ -127,16 +127,16 @@ std::vector<AtomCoordinate> extractChain(
}

void printAtomCoordinateVector(std::vector<AtomCoordinate>& atoms, int option) {
for (AtomCoordinate curr_atm : atoms) {
for (const AtomCoordinate& curr_atm : atoms) {
curr_atm.print(option);
}
}

std::vector<AtomCoordinate> filterBackbone(std::vector<AtomCoordinate>& atoms) {
std::vector<AtomCoordinate> filterBackbone(const std::vector<AtomCoordinate>& atoms) {
std::vector<AtomCoordinate> output;
for (AtomCoordinate curr_atm : atoms) {
for (const AtomCoordinate& curr_atm : atoms) {
if (curr_atm.isBackbone()) {
output.push_back(curr_atm);
output.emplace_back(curr_atm);
}
}
return output;
Expand Down Expand Up @@ -168,7 +168,7 @@ void reverse(char* s) {
s[i] = s[j];
s[j] = c;
}
}
}

void itoa_pos_only(int n, char* s) {
int i = 0;
Expand Down Expand Up @@ -373,11 +373,11 @@ std::vector<AtomCoordinate> getAtomsWithResidueIndex(
std::vector<std::string> atomNames
) {
std::vector<AtomCoordinate> output;
for (AtomCoordinate curr_atm : atoms) {
for (const AtomCoordinate& curr_atm : atoms) {
if (curr_atm.residue_index == residue_index) {
for (std::string atom_name : atomNames) {
for (const std::string& atom_name : atomNames) {
if (curr_atm.atom == atom_name) {
output.push_back(curr_atm);
output.emplace_back(curr_atm);
}
}
}
Expand All @@ -396,3 +396,15 @@ std::vector< std::vector<AtomCoordinate> > getAtomsWithResidueIndex(
}
return output;
}

float RMSD(std::vector<AtomCoordinate>& atoms1, std::vector<AtomCoordinate>& atoms2) {
// RMSD: Root Mean Square Deviation
float sum = 0;
// Sum of square of distance
for (size_t i = 0; i < atoms1.size(); i++) {
sum += pow(atoms1[i].coordinate.x - atoms2[i].coordinate.x, 2);
sum += pow(atoms1[i].coordinate.y - atoms2[i].coordinate.y, 2);
sum += pow(atoms1[i].coordinate.z - atoms2[i].coordinate.z, 2);
}
return sqrt(sum / atoms1.size());
}
9 changes: 5 additions & 4 deletions src/atom_coordinate.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* Description:
* The data type to handle atom coordinate comes here.
* ---
* Last Modified: 2022-09-27 11:50:15
* Last Modified: 2022-10-06 19:54:30
* Modified By: Hyunbin Kim (khb7840@gmail.com)
* ---
* Copyright © 2021 Hyunbin Kim, All rights reserved
Expand Down Expand Up @@ -48,8 +48,8 @@ class AtomCoordinate {
//BackboneChain toCompressedResidue();

//methods
bool isBackbone();
void print(int option = 0);
bool isBackbone() const;
void print(int option = 0) const ;
void setTempFactor(float tf) { this->tempFactor = tf; };
};

Expand All @@ -70,7 +70,7 @@ std::vector<AtomCoordinate> extractChain(
std::vector<AtomCoordinate>& atoms, std::string chain
);

std::vector<AtomCoordinate> filterBackbone(std::vector<AtomCoordinate>& atoms);
std::vector<AtomCoordinate> filterBackbone(const std::vector<AtomCoordinate>& atoms);

void printAtomCoordinateVector(std::vector<AtomCoordinate>& atoms, int option = 0);

Expand Down Expand Up @@ -106,6 +106,7 @@ std::vector< std::vector<AtomCoordinate> > getAtomsWithResidueIndex(
std::vector<AtomCoordinate>& atoms, std::vector<int> residue_index,
std::vector<std::string> atomNames = {"N", "CA", "C"}
);
float RMSD(std::vector<AtomCoordinate>& atoms1, std::vector<AtomCoordinate>& atoms2);

template <int32_t T, int32_t P>
void ftoa(float n, char* s);
2 changes: 1 addition & 1 deletion src/dbreader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ void* make_reader(const char *data_name, const char *index_name, int32_t data_mo
fclose(file);
}

char cache_name[FILENAME_MAX];
// char cache_name[FILENAME_MAX];
// if ((data_mode & DB_READER_NO_CACHE) == 0) {
// sprintf(cache_name, "%s.cache.%d", index_name, data_mode);

Expand Down
35 changes: 16 additions & 19 deletions src/foldcomp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
*/
#include "foldcomp.h"

#include "amino_acid.h"
#include "sidechain.h"
#include "torsion_angle.h"
#include "utility.h"
Expand Down Expand Up @@ -275,7 +276,7 @@ int reconstructBackboneReverse(
int discretizeSideChainTorsionAngles(
std::vector< std::vector<float> >& torsionPerResidue,
std::vector<std::string>& residueNames,
std::map<std::string, AminoAcid>& AAS,
const std::map<std::string, AminoAcid>& AAS,
SideChainDiscretizers& scDiscretizers,
std::map<std::string, std::vector<Discretizer> >& scDiscretizersMap,
std::vector<unsigned int>& output
Expand Down Expand Up @@ -341,28 +342,23 @@ int continuizeSideChainTorsionAngles(
std::map<std::string, std::vector<Discretizer> >& scDiscretizersMap,
std::vector< std::vector<float> >& output
) {
// Declare
int success = 0;
std::string currResidue;
int currResidueTorsionNum = 0;
int currIndex = 0;
float currTorsion = 0;

scDiscretizersMap = initializeSideChainDiscMap();
success = fillSideChainDiscretizerMap(scDiscretizers, scDiscretizersMap);
int success = fillSideChainDiscretizerMap(scDiscretizers, scDiscretizersMap);
std::vector< std::vector<float> > torsionPerResidue;
std::vector<float> currTorsionVector;
FixedAngleDiscretizer currDiscretizer = FixedAngleDiscretizer(pow(2, NUM_BITS_TEMP) - 1);
FixedAngleDiscretizer currDiscretizer(pow(2, NUM_BITS_TEMP) - 1);
// Iterate
int currIndex = 0;
for (size_t i = 0; i < residueNames.size(); i++) {
currResidue = residueNames[i];
currResidueTorsionNum = getSideChainTorsionNum(currResidue);
std::string currResidue = residueNames[i];
int currResidueTorsionNum = getSideChainTorsionNum(currResidue);
currTorsionVector.clear();
currTorsionVector.resize(currResidueTorsionNum);
for (int j = 0; j < currResidueTorsionNum; j++) {
// Get current torsion angle vector
//currTorsion = scDiscretizersMap[currResidue][j].continuize(torsionDiscretized[currIndex]);
currTorsion = currDiscretizer.continuize(torsionDiscretized[currIndex]);
float currTorsion = currDiscretizer.continuize(torsionDiscretized[currIndex]);
currTorsionVector[j] = currTorsion;
currIndex++;
}
Expand Down Expand Up @@ -1425,18 +1421,18 @@ void Foldcomp::printSideChainTorsion(std::string filename) {
std::map<std::string, float> currTorsionAngle;

for (int i = 0; i < this->nResidue; i++) {
currBondAngle = calculateBondAngles(atomByResidue[i], this->AAS[this->residueThreeLetter[i]]);
currBondLength = calculateBondLengths(atomByResidue[i], this->AAS[this->residueThreeLetter[i]]);
currBondAngle = calculateBondAngles(atomByResidue[i], this->AAS.at(this->residueThreeLetter[i]));
currBondLength = calculateBondLengths(atomByResidue[i], this->AAS.at(this->residueThreeLetter[i]));
for (const auto& bl: currBondLength) {
outfile << i << "," << this->residueThreeLetter[i] << ",BondLength,";
outfile << bl.first << "," << bl.second << ",NA,NA,NA,";
outfile << this->AAS[this->residueThreeLetter[i]].bondLengths[bl.first] << ",";
outfile << bl.second - this->AAS[this->residueThreeLetter[i]].bondLengths[bl.first] << "\n";
outfile << this->AAS.at(this->residueThreeLetter[i]).bondLengths.at(bl.first) << ",";
outfile << bl.second - this->AAS.at(this->residueThreeLetter[i]).bondLengths.at(bl.first) << "\n";
}
for (const auto& ba : currBondAngle) {
outfile << i << "," << this->residueThreeLetter[i] << ",BondAngle,";
outfile << ba.first << "," << ba.second << ",NA,NA,NA," << this->AAS[this->residueThreeLetter[i]].bondAngles[ba.first] << ",";
outfile << ba.second - this->AAS[this->residueThreeLetter[i]].bondAngles[ba.first] << "\n";
outfile << ba.first << "," << ba.second << ",NA,NA,NA," << this->AAS.at(this->residueThreeLetter[i]).bondAngles.at(ba.first) << ",";
outfile << ba.second - this->AAS.at(this->residueThreeLetter[i]).bondAngles.at(ba.first) << "\n";
}
for (size_t j = 0; j < this->sideChainAnglesPerResidue[i].size(); j++) {
outfile << i << "," << this->residueThreeLetter[i] << ",TorsionAngle,";
Expand Down Expand Up @@ -1528,7 +1524,7 @@ void printValidityError(ValidityError err, std::string& filename) {
}
}

void _reorderAtoms(std::vector<AtomCoordinate>& atoms, AminoAcid& aa) {
void _reorderAtoms(std::vector<AtomCoordinate>& atoms, const AminoAcid& aa) {
std::vector<AtomCoordinate> newAtoms = atoms;
for (size_t i = 0; i < atoms.size(); i++) {
if (atoms[i].atom == aa.altAtoms[i]) {
Expand Down Expand Up @@ -1774,3 +1770,4 @@ int getSideChainTorsionNum(std::string residue) {
return out;
}

const std::map<std::string, AminoAcid> Foldcomp::AAS = AminoAcid::AminoAcids();
16 changes: 8 additions & 8 deletions src/foldcomp.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
#pragma once
#include "float3d.h"

#include "amino_acid.h"
#include "atom_coordinate.h"
#include "discretizer.h"
#include "nerf.h"
Expand All @@ -33,6 +32,9 @@
#include <string>
#include <vector>

// forward declaration
class AminoAcid;

// CONSTANTS
#define NUM_TYPE_OF_ANGLES 6
#define MAGICNUMBER_LENGTH 4
Expand All @@ -50,6 +52,8 @@
#define C_TO_N_DIST 1.3311
#define PRO_N_TO_CA_DIST 1.353

#define DEFAULT_ANCHOR_THRESHOLD 25

// ERROR CODES FOR CHECKING VALIDITY
enum ValidityError {
SUCCESS = 0,
Expand Down Expand Up @@ -246,7 +250,7 @@ int fillSideChainDiscretizerMap(
std::map<std::string, std::vector<Discretizer> >& scDiscretizersMap
);

void _reorderAtoms(std::vector<AtomCoordinate>& atoms, AminoAcid& aa);
void _reorderAtoms(std::vector<AtomCoordinate>& atoms, const AminoAcid& aa);

// Print
void printCompressedResidue(BackboneChain& res);
Expand Down Expand Up @@ -284,10 +288,6 @@ class Foldcomp {
);

public:
Foldcomp(){
AminoAcid aa;
this->AAS = aa.AminoAcids();
};
bool isPreprocessed = false;
bool isCompressed = false;
bool backwardReconstruction = true;
Expand All @@ -299,7 +299,7 @@ class Foldcomp {
int nSideChainTorsion = 0;
int nInnerAnchor = 0;
int nAllAnchor = 0;
int anchorThreshold = 200;
int anchorThreshold = DEFAULT_ANCHOR_THRESHOLD;
// Indices for residue & atom
int idxResidue = 0;
int idxAtom = 0;
Expand Down Expand Up @@ -356,7 +356,7 @@ class Foldcomp {
std::vector<unsigned int> ca_c_n_angleDiscretized;
Discretizer c_n_ca_angleDisc;
std::vector<unsigned int> c_n_ca_angleDiscretized;
std::map<std::string, AminoAcid> AAS;
const static std::map<std::string, AminoAcid> AAS;
Nerf nerf;
// Sidechain angles
std::vector<float> sideChainAngles;
Expand Down
Loading

0 comments on commit 619f5b2

Please sign in to comment.