-
Notifications
You must be signed in to change notification settings - Fork 0
Information Theory
Informally, information is some message stored or transmitted using some medium. Messages are formed by arranging symbols in specific patterns. Information can be measured and compared using a measurement called entropy. Information is a selection from a collection of possible symbols.
The message space is the set of all possible messages. Example: the Polybius square was a 5 by 5 grid that could represent 25 distinct messages. Example: Sushruta Samhita: Given six different spices, how many possible different tastes can you make? Given n yes or no questions, there are 2^n possible answer sequences. Example: Lord George Murray's Shutter Telegraph.
A symbol can be broadly defined as the current state of some observable signal which persists for persists for a fixed period of time.
A signaling event is a change from one state to another.
What is a symbol space?
The differences between signaling events must be great enough that noise cannot randomly bump one signaling event from one type to another.
Example: The Quadruplex Telegraph
Limited by electrical noise, which are minute, undesired currents in an electrical signal. Noise is a result of natural processes such as heat, storms, and even the Big Bang.
There are discrete, continuous, and mixed communication systems.
- Discrete systems ─ both the message and the signal are a sequence of discrete symbols.
- Continuous systems ─ both the message and the signal are treated as continuous functions.
- Mixed systems ─ both discreet and continuous variables appear, e.g., PCM transmission of speech.
In general, a discreet channel is "a system whereby a sequence of choices from a finite set of elementary symbols S₁,...,Sn can be transmitted from one point to another."
"It is not required that all possible sequences of the Si be capable of transmission on the system; certain sequences only may be allowed."
A discreet source can be considered to be a Markov process.
Can we think of a protein sequence as a message or signal of a discrete communication system? If so, is it noiseless?
Signal vs Noise: Do protein sequences contain noise? If so, eliminating it during preprocessing could make the algorithm more efficient.
- Can we treat the symbols in protein sequences to be signals?
- What is noise?
A series of simple artificial languages can be used to approximate to a natural language.
- zero-order approximation
- all letters are chosen with the same probability
- all letters are chosen independently
- first-order approximation
- each letter has the same probability that is has in the natural language
- successive letters are chosen independently
- second-order approximation
- digram structure
- after a letter is chosen, the next one is chosen according to the frequencies with which the various letters follow the first one
- third-order approximation
- trigram structure
- each letter is chosen with probabilities which depend on the preceding two letters
Entropy is a measure of how much uncertainty there is in an outcome.
Theorem(Shannon 1948). Suppose H(p₁, p₂,...,pn) is a measure of the uncertainty of the outcome given a set of possible events whose probabilities of occurrence are p₁, p₂,...,pn.
H = -K * Σ[pi * log(pi)] where K is a positive constant.
Some interesting properties of H:
- If H = 0 then the outcomes are certain, otherwise H > 0.
- H = log(n) is a max when all pi are equal.
- H(x, y) ≤ H(x) + H(y) with equality only if the events are independent.
- Any change towards the equalization of probabilities increases H.
-
C. E. Shannon: A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379–423 and 623–656, July and October 1948
-
"What Is FASTA Format?" FASTA Format. Zhang Lab, University of Michigan, n.d. Web. 24 Sept. 2016.
-
Rost, B.; J.Liu; R.Nair; K.O. Wrzeszczynski; Y. Ofran (2003). "Automatic prediction of protein function". Cellular and Molecular Life Sciences. 60 (12): 2637–2650. doi:10.1007/s00018-003-3114-8. PMID 14685688.
-
Alpaydin, Ethem. "Introduction to Machine Learning." MIT Press, 04 Dec. 2009. Web. 24 Sept. 2016.
-
Nielsen, Michael. "Neural Networks and Deep Learning." Neural Networks and Deep Learning. N.p., June 2016. Web. 24 Sept. 2016.
-
"Implementing a Neural Network from Scratch in Python – An Introduction." WildML. N.p., 10 Jan. 2016. Web. 24 Sept. 2016.
-
"Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs." WildML. N.p., 08 July 2016. Web. 24 Sept. 2016.
-
Singh, Jagjit. Great Ideas in Information Theory, Language and Cybernetics. New York: Dover Publications, 1966. Print.
-
"Structural Bioinformatics Guide: Protein Structure, Protein Sequence Analysis and Alignment, Homology Modeling." Protein Structure and Structural Bioinformatics. N.p., n.d. Web. 25 Sept. 2016.
-
Kurzynski, M.; Chelminiak, P. Stochastic Dynamics of Proteins and the Action of Biological Molecular Machines. Entropy 2014, 16, 1969-1982.
-
Strait, B J, and T G Dewey. “The Shannon Information Entropy of Protein Sequences.” Biophysical Journal 71.1 (1996): 148–155. Print.
-
Heunen, Christiaan Johan Marie., Mehrnoosh Sadrzadeh, and Edward Grefenstette. Quantum Physics and Linguistics: A Compositional, Diagrammatic Discourse. Oxford, United Kingdom: Oxford UP, 2013. Print.