Skip to content
/ alk Public

Anytime Lazy kNN: A fast anytime kNN search algorithm that assesses only true kNN candidates in a lazy fashion.

License

Notifications You must be signed in to change notification settings

IIIA-ML/alk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IIIA-CSIC DOI

Anytime Lazy kNN

Anytime Lazy kNN (ALK) is an anytime algorithm for fast k-nearest neighbors (kNN) search. It finds exact kNN when allowed to run to completion with remarkable gain in execution time compared to a brute-force search. For applications where the gain in exact kNN search may not suffice, ALK can be interrupted earlier, and it returns best-so-far kNN together with a confidence value attached to each neighbor. Furthermore, it can automatically interrupt the search upon reaching a given confidence threshold and resume if so asked.

ALK owes its speed to detecting and assessing only true kNN candidates of the given query. Candidacy assessment is based on the triangle inequality in metric spaces. An instance in the problem space is evaluated only when it is deemed a candidate by ALK, hence the lazy.

The algorithm is developed particularly for fast retrieval in large-scale case bases where temporally related cases form sequences. A typical example is a case base of health records of patients. Treatment sessions of a particular patient can be regarded as a sequence of temporally related cases. Beside being an algorithm, ALK also introduces a methodology which may be applied to exact and approximate kNN search in domains with similar problem space characteristics.

ALK Classifier is an extension to ALK for its use as a kNN classifier. ALK Classifier also offers the option to interrupt the algorithm upon guaranteeing exact solution without the need to find all exact kNN, when possible. Thus, this option further speeds up kNN classification.

ALK and ALK Classifier have been developed as part of the authors' PhD research at the Artificial Intelligence Research Institute, IIIA-CSIC. For further reading on Anytime Lazy kNN, please refer to the articles [1] and [2]. For a deeper dive into details, please refer to the thesis [3].

If you find our software useful for your research, please consider citing it as detailed below.

Table of Contents

How to Use

ALK is implemented in python and uses some of python scientific libraries. Below you can find information on how to get the software up and running on your local machine (tested on OS X 10.14 and Ubuntu 18.04), and conduct experiments on publicly available time series datasets [4] which are used to generate demo case bases of our interest.

Prerequisites

We recommend using a virtual environment created for Python 3.7+.

For the remaining part of the document, we will assume ALK is git cloned into the local folder ~/Dev/alk and miniconda is used as the package and virtual environment manager.

Quick Start

Create and activate a virtual env:

$ conda create -n alk python=3.7.3  # Or a higher version of your choice
$ conda activate alk
(alk) $

Install required scientific libraries:

(alk) $ conda install --file ~/Dev/alk/requirements.txt

And you are ready to go...

Demos

A fully fledged experimentation with Anytime Lazy KNN consists of three steps:

  1. Gather execution insights of ALK on a case base,
  2. Using these insights, generate the Performance Distrubution Profile (PDP) of ALK for that case base,
  3. Use PDP to estimate the confidence for best-so-far kNN and, thus, to automatize interruption upon reaching given confidence thresholds. Finally, calculate gain & efficiency of confidence of ALK upon these interruptions.

In following subsections, we provide the scripts to conduct these three steps. For demo purposes, ALK uses local copies of the arff-formatted time series datasets that are publicly available in [4]. Euclidean distance is used as the metric which is normalized taking into account the min and max values of the related dataset.

If not stated otherwise in script arguments, ALK assumes that:

  • Time series datasets are downloaded to ~/Dev/alk/datasets,
  • Output files of the experiments are saved in ~/Dev/alk/results,
  • Generated PDP files are saved in ~/Dev/alk/pdps,
  • Plotted figures and exported LaTeX tables are saved in ~/Dev/alk/figures,
  • Log files are saved in ~/Dev/alk/logs.

For help on a script's command-line arguments, run the script with the --help option. Plots are saved only when a valid output file format is passed as an argument to the alk.run.fig script, e.g. -f png. If a file format is not given, the plot is displayed as a figure. Plot functions are highly parametric, see related function signature for further plotting options.

Insights Experiments

Below script, first generates a case base of above-mentioned characteristics out of a time series dataset. Then, it arbitrarily splits a part of this case base to use as test sequences. Each problem part of a case in a test sequence is posed as a query to ALK. ALK records experiment insights data to be used in below experimentation steps. Most importantly, insights data is used to generate the Quality Map of ALK. Quality Map is the statistical data composed of (number of candidate assessments, output quality) pairs gathered during an experiment.

Example:

  • Use SwedishLeaf_TEST.arff dataset
  • Generate a case base with time window width=40 and time window step=10 settings
  • Set k=9 for kNN search
  • Use 10% of the dataset as test sequences to generate queries, and the rest as the case base
  • Set log levels -> console: INFO, file: DEBUG
(alk) alk $ pwd
~/Dev/alk
(alk) alk $ python -m alk.run.run_insights "~/Dev/alk/datasets/SwedishLeaf_TEST.arff" -k 9 -t 0.1 --width 40 --step 10 --logc INFO --logf DEBUG

Plot Gain

Plot ALK's gain throughout updates of test sequences.

Example:

  • Use the insights data of the above experiment
  • Save to PNG file
(alk) alk $ python -m alk.run.fig ~/Dev/alk/results/INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td.pk -p g -f png 

Plot Insights

Plot the total and actual similarity assessments made by ALK to find kNN. actual value for a kNN member is the number of similarity assessments after which it is actually found, and total value is the total number of assessments made to ascertain its exactness.

Example:

  • Use the insights data of the above experiment
  • Save to PNG file
(alk) alk $ python -m alk.run.fig ~/Dev/alk/results/INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td.pk -p i -f png --kwargs total=True actual=True all_k=True all_ticks=False with_title=True signature=True marker_size=0

Plot Quality Map

Example:

  • Plot for kNN[2] (i.e. third top-most kNN member) and problem update=10
  • Filling in between calcs where kNN[2] has changed, culling 60% of data points
  • Save to PNG file
(alk) alk $ python -m alk.run.fig ~/Dev/alk/results/INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td.pk -p qm -f png --kwargs with_title=False signature=False colored=False urange="(10, 10)" ki=2 start_calc=3 ustep=1 fill=True q_full_scale=False cull=.6

Export Gain Table

This script generates the Average Gain for Insights Experiments LaTeX table.

Example:

  • Before running this script, copy the output files of the insights experiments that you want to report into a folder; in this example, ~/Desktop/results
(alk) alk $ python -m alk.run.gain_insights -p ~/Desktop/results --rtrim _TRAIN _TEST

Generate Performance Distribution Profile

This script generates the PDP of ALK out of the Quality Map obtained in the above experiment. PDP is used to estimate the expected quality (i.e. confidence) of the best-so-far kNN upon interruption. Furthermore, it provides us a means to estimate the number of similarity assessments needed to reach a confidence threshold. A PDP is generated for each experiment case base.

Example:

  • Build the PDP of ALK for the SwedishLeaf_TEST case base generated for the above experiment
  • Use the insights data in the experiment's output file
  • For the discretization of the quality and calculation ranges:
    • Use q_step=0.05 and calc_step=0.025 (i.e. 1/40 of the max. number of assessments encountered for a test sequence during the experiment)
(alk) alk $ python -m alk.run.gen_pdp ~/Dev/alk/results/INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td.pk 0.025 0.05 --logc DEBUG --logf DEBUG

Plot PDP

PDP is essentially a 4-dimensional array. We plot 2D excerpts.

Example:

  • Plot PDP excerpt for kNN[2] and problem update=10
  • Plot quality range starting from 0.75
  • Save to PDF
(alk) alk $ python -m alk.run.fig ~/Dev/alk/pdps/PDP_INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td__cs_0.025_qs_0.05.pk -p p -f pdf --kwargs update=10 ki=2 to_latex=False decimals=3 start_q=.75

Export PDP as LaTeX

Example:

  • Export PDP excerpt for kNN[2] and problem update=10
(alk) alk $ python -m alk.run.fig ~/Dev/alk/pdps/PDP_INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td__cs_0.025_qs_0.05.pk -p p --kwargs update=10 ki=2 to_latex=True decimals=3 start_q=.75

Interruption Experiments

This script is used to conduct experiments to collect gain and confidence efficiency of ALK upon interruptions. First, test sequences are extracted from the case base that is generated for the given time series dataset. Then, for each problem update of each test sequence, the script interrupts ALK upon reaching a given confidence threshold. Afterwards, the algorithm is resumed and allowed to run until reaching the next threshold. After each interruption, the gain achieved by avoided similarity calculations, the quality of best-so-far kNN compared to the exact kNN, and, the efficiency of the confidence measure are recorded.

Every time series dataset in the repository [4] is available as a two-pack of train and test sub-datasets. Some test sub-datasets are larger than their train reciprocals. We opt to use the larger one for the insights experiments, and the smaller one for the interruption experiments.

Example:

  • Use SwedishLeaf_TRAIN.arff dataset, (note that we had used SwedishLeaf_TEST.arff at the insights step)
  • Generate a case base with the same time window & step settings used in the related insights experiment
  • Use 1% of the dataset for test sequences, and the rest as the case base
  • Use the PDP generated in the above step
  • Interrupt at confidence thresholds .98 .95 .92 .9 .85 .8 .75 .7 (in reverse order)
  • Set z factor in the efficiency measure to -1 for the standard deviation
(alk) alk $ python -m alk.run.run_intrpt ~/Dev/alk/datasets/SwedishLeaf_TRAIN.arff ~/Dev/alk/pdps/PDP_INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td__cs_0.025_qs_0.05.pk -t 0.01 -c .98 .95 .92 .9 .85 .8 .75 .7 -z -1 --logc INFO --logf DEBUG

Plot Efficiency of Confidence

Example:

  • Use the output of the above interruption experiment
(alk) alk $ python -m alk.run.fig "INT_SwedishLeaf_TRAIN_w_40_s_10_k_9_r_td_PDP_SwedishLeaf_TEST_c_0.025_q_0.05__ct_[0.98_0.95_0.92_0.9_0.85_0.8_0.75_0.7]_z_-1_t_0.01.pk" -p e -f png --dir ~/Dev/alk/results/ --kwargs maximized=False with_title=True signature=True outliers=False aspect=.5

To plot more than one experiment with the same dataset but different time window settings, simply provide their result files in the first positional argument.

Export Gain & Efficiency Table

This script generates the Average Gain % upon Interruption at Confidence Thresholds LaTeX table. The table summarizes the average gain upon exact and interrupted kNN search for multiple experiments. The average efficiency of confidence is also given for each experiment.

Before running this script, copy the output files of the interruption experiments that you want to report into a folder; in this example, ~/Desktop/results.

Example:

  • Read all interruption experiment output files at ~/Desktop/results
  • Take into account only the experiments that used z=-1 setting, silently ignore others
  • For each experiment
    • Export the average gain for the last kNN member (zero-based -> 8 for a k=9 setting)
    • Export the average efficiency and its average standard deviation
(alk) alk $ python -m alk.run.gain_intrpt_classify -p ~/Desktop/results -c 1. .98 .95 .92 .9 .85 .8 .75 .7 --knni 8 --z -1 --rtrim _TRAIN _TEST

Classification Experiments

This script is used to conduct experiments to collect gain, confidence efficiency and solution hit data of ALK Classifier upon interruptions at confidence thresholds and/or upon guaranteeing exact solution with best-so-far kNN. A solution hit upon interruption occurs when the solution suggested by best-so-far kNN is equal to the solution with exact kNN.

Example:

  • Use SwedishLeaf_TRAIN.arff dataset
  • Use the PDP generated above for the ~_TEST CB of the same dataset
  • Interrupt upon guaranteeing exact solution
  • Use plurality vote for classification
  • Also interrupt at confidence thresholds .98 .95 .92 .9 .85 .8 .75 .7 (in reverse order)
  • Set z factor in the efficiency measure to -1 for the standard deviation
(alk) alk $ python -m alk.run.run_classify ~/Dev/alk/datasets/SwedishLeaf_TRAIN.arff ~/Dev/alk/pdps/PDP_INS_SwedishLeaf_TEST_w_40_s_10_k_9_t_0.1_r_td__cs_0.025_qs_0.05.pk -t 0.01 -c .98 .95 .92 .9 .85 .8 .75 .7 -z -1 --reuse p --wsoln 1 --logc INFO --logf DEBUG

To use distance-weighted vote, pass 'w' value to the reuse argument.

Export Extended Gain & Efficiency Table

This script generates the Average Gain % upon Interruption at Confidence Thresholds and with Exact Solutions LaTeX table. It extends the above table with a column for the gain at interruption upon guaranteeing exact solution.

Example:

  • Export the LaTeX table for the gains in classification experiment results at ~/Desktop/results
  • For argument descriptions see
(alk) alk $ python -m alk.run.gain_intrpt_classify -p ~/Desktop/results -c 1. .98 .95 .92 .9 .85 .8 .75 .7 --z -1 --clsfy 1 --rtrim _TRAIN _TEST

Export Solution Hit Table

This script generates the Average Solution Hit %s LaTeX table.

Example:

  • Use the same classification experiment results above
  • Also generate a column for the hit % at interruptions with exact solution (where all column values have to be 100%)
(alk) alk $ python -m alk.run.hit_classify -p ~/Desktop/results -c .98 .95 .92 .9 .85 .8 .75 .7 --z -1 --wsoln 1 --rtrim _TRAIN _TEST

Alternative Rank Iterations

These are alternative searches for kNN candidates in the internal data structure RANK. The default iteration style is Top Down iteration of RANK's Stages.

Jumping

After evaluating every nth candidate, this iteration makes a momentary jump to the next Stage in RANK for candidacy assessment.

Example:

  • Jump after: [1, 2, 5, 10, 50]
(alk) alk $ python -m alk.run.run_jump "~/Dev/alk/datasets/SwedishLeaf_TEST.arff" -k 9 -t 0.01 --width 40 --step 10 --jumps 1 2 5 10 50 --logc INFO --logf DEBUG
Export Jumping Gain

Exports the Average Gain for TopDown vs Jumping Rank Iterations LaTeX table.

Example:

  • Use experiment results at ~/Desktop/results
(alk) alk $ python -m alk.run.gain_jump -p ~/Desktop/results --rtrim _TRAIN _TEST

Exploit Approaching Candidates

During the kNN search for a problem update Pu, if a candidate proves nearer to Pu than it was to a predecessor problem Pj (j < u), this iteration exploits the predecessor and successor cases of that candidate to check if they get nearer to Pu as well. A hash table is used to access the temporally related cases in RANK. The code is not yet optimized for the maintenance of the hash table in this prototype. Use it at the peril of long execution times.

Example:

(alk) alk $ python -m alk.run.run_exploit "~/Dev/alk/datasets/SwedishLeaf_TEST.arff" -k 9 -t 0.01 --width 40 --step 10 --logc INFO --logf DEBUG
Export Exploiting Gain

Exports the Average Gain for TopDown vs Exploit Candidates Rank Iterations LaTeX table.

Example:

  • Use experiment results at ~/Desktop/results
(alk) alk $ python -m alk.run.gain_exploit -p ~/Desktop/results --rtrim _TRAIN _TEST

Tools

Similarity Distribution

This script exports the similarity distribution in given case base(s) as a LaTeX table. The distribution is calculated by extracting a proportion of cases from the case base and computing their similarities to all remaining cases. This tool can be used to a) check if the similarity metric discriminates the cases well enough; b) check if kNN search is prone to the so-called curse of dimensionality; c) have a preliminary idea of the gain of ALK for a case base.

Example:

  • Use SwedishLeaf_TEST.arff dataset
  • Generate a case base with time window width=40 and time window step=10 settings
  • Use 1% of the dataset as test sequences to generate queries, and the rest as the case base
  • Distribute the similarity value densities as percentage in 10 bins
(alk) alk $ python -m alk.run.sim_distr ~/Dev/alk/datasets/SwedishLeaf_TEST.arff --width 40 --step 10 --bins 10 --testsize 0.01

To check the similarity distribution only for the uth update of test sequences, pass --update u as argument to the script (e.g. --update 3). To check for the last update only, pass --update -1.

Plot Cases

Plot the problem parts of the cases corresponding to the given update of the given/random time series sequences.

Example:

  • Use the case base generated above
  • Plot the 5th case for the 4th and 14th sequences (For two random sequences, pass seqs=2)
  • Plot full sequences on the background as well
  • Display the plot (don't save it)
(alk) alk $ python -m alk.run.fig ~/Dev/alk/datasets/SwedishLeaf_TRAIN.arff -p cb --kwargs seqs="[4, 14]" width=40 step=10 full=True upd_ind=5

Authors

License

This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.

Citation

To cite the Anytime Lazy kNN software, we recommend using the following biblatex entry with the DOI attached to a specific version (e.g. v2.3):

@software{mulayim_2021_4472642,
  author    = {M\"{u}l\^{a}yim, Mehmet O\u{g}uz and Arcos, Josep Llu\'{i}s},
  title     = {Anytime Lazy kNN (ALK)},
  month     = jan,
  year      = 2021,
  publisher = {Zenodo},
  version   = {v2.3},
  doi       = {10.5281/zenodo.4472642},
  url       = {https://doi.org/10.5281/zenodo.4472642}
}

If you want to cite the repository regardless of versions, you can use the concept DOI:

@software{mulayim_2020_4472641,
  author    = {M\"{u}l\^{a}yim, Mehmet O\u{g}uz and Arcos, Josep Llu\'{i}s},
  title     = {Anytime Lazy kNN (ALK)},
  month     = nov,
  year      = 2020,
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.4472641},
  url       = {https://doi.org/10.5281/zenodo.4472641}
}

Kindly consider citing our software together with the thesis:

@phdthesis{mulayim_2020_thesis,
  author  = {M\"{u}l\^{a}yim, Mehmet O\u{g}uz},
  title   = {{Anytime Case-Based Reasoning in Large-Scale Temporal Case Bases}},
  school  = {Universitat Aut\`{o}noma de Barcelona},
  address = {Barcelona, Spain},
  month   = nov,
  year    = 2020,
  url     = {http://hdl.handle.net/10803/671283}
}

References

[1] M.O. Mülâyim, J.L. Arcos (2018), Perks of Being Lazy: Boosting Retrieval Performance, in: M.T. Cox, P. Funk, S. Begum (Eds.), International Conference on Case-Based Reasoning (ICCBR'18), Springer Verlag: pp. 309–322

[2] M.O. Mülâyim, J.L. Arcos (2020), Fast Anytime Retrieval with Confidence in Large-Scale Temporal Case Bases, Knowledge-Based Systems, 206, 106374

[3] M.O. Mülâyim (2020), Anytime Case-Based Reasoning in Large-Scale Temporal Case Bases, PhD Thesis, Universitat Autònoma de Barcelona

[4] A. Bagnall, J. Lines, W. Vickers, E. Keogh, The UEA & UCR Time Series Classification Repository (Last accessed 20 January 2020)

About

Anytime Lazy kNN: A fast anytime kNN search algorithm that assesses only true kNN candidates in a lazy fashion.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages