Skip to content
Asis Hallab edited this page Dec 8, 2021 · 8 revisions

Development Wiki

prot-scriber terminology

Candidate description

This is a String taken from a Blast Hit's stitle, which has been parsed so that uninformative parts are cut away and it must have successfully passed the blacklist filtering. For example alcohol dehydrogenase. A candidate description can also be interpreted as mathematical set of words. The filtered stitle is split using a regular expression and the resulting words represent the candidate description.

Phrase

A phrase is Vec<String> of words. It is a subset of a candidate description, created by building power-set of a candidate description. A power set is the set of all subsets of a source, including the source set itself. A phrase can also be referred to in the context of a String, i.e. its words concatonated and separated by into a single string.

User stories

Implement annotation module

The module should expose a single annotation function. This function receives a vector of Strings and from these produces a short human readable description (HRD). For details on the algorithm see the existing implementation in R in this branch and this presentation.

Main annotation function generateHumanReadableDescription( candidateHRDs: Vec<String> ) specification

The input, i.e. the candidateHRDs vector will already be the result of having filtered out blacklisted candidate descriptions. Also the input vector will already be filtered descriptions, so not needed parts (sub-strings) will have been removed. Thus, the input vector consists of "worthy" candidate descriptions only and are already pre-processed descriptions taken from sequence similarity search results (Blast Hits). So the pre-processing actually implements (i) removing (blacklisting) candidate descriptions and (ii) the stripping of uninformative parts, like species information.

The annotation function should implement the following steps. Ideally each of these steps is implemented in a separate function, so we adhere to common software design principles like "separation of concern" and we can make each function a single issue to be implemented by one of our team members.

Note that prot-scriber processes all words in lower-case.

  1. Chop candidate descriptions into words.
  2. Build a universe word-set from all words in the candidate description.
  3. Filter the universe words using an input vector of regular expressions and flag words that should be scored ("informative") and those that should not be scored ("un-informatice"). See the optional input "informative regex list" in issue #9. Have as a result a dictionary of all words, where the values are boolean; true means should be scored.
  4. Build word-frequencies as a HashMap of keys informative words and values number of occurrences in all of the input candidate descriptions. Note that the frequency comes from the source (input) candidate descriptions, and not the phrases!
  5. Normalize frequencies to range from (0,1], by division with highest frequency.
  6. Build power-sets of these candidate descriptions, i.e. all subsets of the candidate description word-set, including the candidate set itself. The resulting sub-sets are called "Phrases". Make sure that each phrase word set only appears once! (Build a set of phrases, not Vector of phrases) Each phrase should be a vector of words, that (i) retains the order of words as they appear in the input candidate description, and (ii) enables access to single words by vector-index.
  7. Score the phrases from step 6 based on their to be scored words (informative words). Scoring should be done using the sum of centered inverse information content (see R implementation and above presentation for details on the formula).
  8. Select the highest scoring phrase. If several phrases have the same score, select the longest one, where length considers not scored (un-informative) words.

The last step is the output.

Some remarks:

  • Make sure that after scaling values above one are set to one - avoid rounding issues in doing so.
  • Make sure the inverse information content is never negative. Only centering creates negative values.
  • The above word-sets should actually be vectors, that retain the order of the words as they appear in the original candidate description this subset was built from.
  • The above power-set would be power-vectors and let's call them phrases
  • Splitting into word-sets and a function for regular expression filtering already exist in the Rust implementation. Make sure the wheel is not reinvented.
  • Each of the above steps should be a separate function. Preferably pass by reference not by value, avoid copying values in memory.
  • In contrast to a scripting language, where your code can be tested interactively in a shell or the browser, in Rust you do that with tests. See existing tests in the Rust implementation for guidance.

Some tips:

  • look at ./src/default.rs to see how to define global constants
  • Use Visual Studio Code with the Rust language server Rust Analyzer and the respective VSCode plugin.

First test run

On our institute's cluster go to directory:

/mnt/data/asis/prot-scriber/Faba_Gene_Families

You will find a gene family file there: pSONIC.txt

Also you find two Blast output tables there:

  • input_queries_vs_Swissprot_blastout.txt
  • input_queries_vs_trembl_blastout.txt

As soon as prot-scriber is ready, please run it on these gene families.