Jumper uses both frecency (frequency+recency at which items have been visited) and match accuracy (how well the query matches the path stored in the database). The main idea is that, when only a few (say <=2) characters have been entered by the user (when fuzzy-finding a path), Jumper will mostly rank matches based on their frecency since these few characters contain very little information for picking the right match. However, as more characters are added, the user's prompt contains more information that can be used the improve the ranking: Jumper will put progressively more emphasis on match accuracy.
Frecency is an heuristic introduced by Mozilla that combines the frequency and recency.
Assume that a match has been visited at times
Here
Let us now motivate a bit the definition of frecency above.
Let us first consider an item that has not been visited for a few days, so that we can neglect the term
We plot this function below:
In the case where the item has just been visited, the frecency above gets an increase of
As we can see from the plot above, the frecency will typically be a number in the range
- It does not diverge at time goes. z uses something like
number-of-visits / time-since-last-visit
, which potentially diverges over time (and therefore require some "aging" mechanism). - It only requires to keep track of the "adjusted" number of visits
$\sum_i e^{-\alpha_2 (t-T_i)}$ and the time of last visit to be computed. - One can give it some statistical interpretation (see below).
Note
We presented above the frecency in the case where all visits had the same weight 1. However, Jumper can attribute different weights for different types of visit (using the -w
option). Everytime a command is executed by the user, Jumper adds a visit of weight
In its original implementation, Mozilla defines frecency as
where
Jumper adds a term
The following is a bit off topic, but it may still be interesting. In 1894, Fusakichi Omori showed empirically that the frequency of aftershocks decreases roughtly as the inverse of the time after an earthquake's main shock. More precisely, he stated that the daily number of aftershocks evolves as
The term
The match accuracy evaluates how well the query entered by the user matches the path stored in the database. Similarly to the fuzzy-finders fzf or fzy, this is done using a variant of the Needleman-Wunsch algorith.
This finds the match that maximizes
U(match) = - 4 * number-of-splits - 0.25 * total-length-of-gaps + bonuses(match)
The bonuses
above give additional points if matches happen at special places, such at the end of the path, or beginning of words. Then the accuracy is
where the maximum is computed over all matches of query
in path
. We call "match" every (possibly non-contiguous) substring of path
whose characters are the ones of query
, in the same order.
Based on these two numbers, Jumper ranks paths using
where -b <value>
.
This additive definition is motivated by the following.
Suppose that one is searching for a path, adding one character to the query
at a time.
At first, when query
has very few character (typically <=2), all the paths containing these two characters consecutively will share the maximum accuracy
.
Hence the ranking will be mostly decided by the frecency.
However, as more characters are added, the ranking will favors matches that are more accurate. The ranking will then be dominated by the accuracy of the matches.
The definitions of scores above can be motivated by the following statistical model. One can see the task of "finding the path the user thought about" given a query as a Bayesian inference problem:
- We assume that the paths picked by the user follow some prior distribution, which is based on the "frecency" of the paths.
- After picking the
path
he would like to jump to, the user is more likely to make a query that matchespath
with high accuracy. Hence the conditional distribution$P(\text{query}|\text{path})$ depends on$\text{accuracy}(\text{query}, \text{path})$ .
Under this model, Jumper simply picks the path that maximizes the posterior probability
Prior distribution: Assume that the visits of a given path is a self-exciting point process, with conditional intensity
independently from the visits to the other folders.
When the user queries the database at a time
Conditional distribution: We model
(
The posterior probability is therefore proportional to
The ranking algorithm simply ranks the paths according to their