Skip to content

Information Theory

Miguel Amezola edited this page Sep 24, 2016 · 15 revisions

Informally, information is some message stored or transmitted using some medium. Messages are formed by arranging symbols in specific patterns. Information can be measured and compared using a measurement called entropy. Information is a selection from a collection of possible symbols.

The message space is the set of all possible messages. Example: the Polybius square was a 5 by 5 grid that could represent 25 distinct messages. Example: Sushruta Samhita: Given six different spices, how many possible different tastes can you make? Given n yes or no questions, there are 2^n possible answer sequences. Example: Lord George Murray's Shutter Telegraph.

Symbols

A symbol can be broadly defined as the current state of some observable signal which persists for persists for a fixed period of time.

A signaling event is a change from one state to another.

What is a symbol space?

The differences between signaling events must be great enough that noise cannot randomly bump one signaling event from one type to another.

Example: The Quadruplex Telegraph

Limited by electrical noise, which are minute, undesired currents in an electrical signal. Noise is a result of natural processes such as heat, storms, and even the Big Bang.

Categories of communication systems

There are discrete, continuous, and mixed communication systems.

  • Discrete systems ─ both the message and the signal are a sequence of discrete symbols.
  • Continuous systems ─ both the message and the signal are treated as continuous functions.
  • Mixed systems ─ both discreet and continuous variables appear, e.g., PCM transmission of speech.

Discrete noiseless systems

In general, a discreet channel is "a system whereby a sequence of choices from a finite set of elementary symbols S₁,...,Sn can be transmitted from one point to another."

"It is not required that all possible sequences of the Si be capable of transmission on the system; certain sequences only may be allowed."

A discreet source can be considered to be a Markov process.

Can we think of a protein sequence as a message or signal of a discrete communication system? If so, is it noiseless?

Signal vs Noise: Do protein sequences contain noise? If so, eliminating it during preprocessing could make the algorithm more efficient.

  • Can we treat the symbols in protein sequences to be signals?
  • What is noise?

Approximating to a natural language

A series of simple artificial languages can be used to approximate to a natural language.

  1. zero-order approximation
    • all letters are chosen with the same probability
    • all letters are chosen independently
  2. first-order approximation
    • each letter has the same probability that is has in the natural language
    • successive letters are chosen independently
  3. second-order approximation
    • digram structure
    • after a letter is chosen, the next one is chosen according to the frequencies with which the various letters follow the first one
  4. third-order approximation
    • trigram structure
    • each letter is chosen with probabilities which depend on the preceding two letters

Entropy

Entropy is a measure of how much uncertainty there is in an outcome.

Theorem(Shannon 1948). Suppose H(p₁, p₂,...,pn) is a measure of the uncertainty of the outcome given a set of possible events whose probabilities of occurrence are p₁, p₂,...,pn.

H = -K * Σ[pi * log(pi)] where K is a positive constant.

Some interesting properties of H:

  1. If H = 0 then the outcomes are certain, otherwise H > 0.
  2. H = log(n) is a max when all pi are equal.
  3. H(x, y) ≤ H(x) + H(y) with equality only if the events are independent.
  4. Any change towards the equalization of probabilities increases H.
Clone this wiki locally