- Source code author: Jianbin Qin
- Version: 0.1
- Contact: jqin@inf.ed.ac.uk
- More information: http://qinjianbin.com/
This work pionerred the trend of using Q-Chunk, fixed length non-overlapping gram, as the signatures (features) generator to perform efficient string similarity query processing. The fixed length non-overlapping scheme has been proven effective by many researchers. There are two versions of QChunk papers and six versions of algorithms:
- In SIGMOD version, we only studied the basis IndexChunk and IndexGram.
- In the Journal version, we further improve this IndexChunk and IndexGram to Turbo and Super charged version by constraint the QChunk selection space.
- Super charged IndexGram is the default version in this source code. Please compare this version, i.e. IndexGramSuper
- Title: Asymmetric signature schemes for efficient exact edit similarity query processing
- Authors: Asymmetric signature schemes for efficient exact edit similarity query processing
- Published in TODS, 2013
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures.
In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is t + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.
- Title: Efficient exact edit similarity query processing with the asymmetric signature scheme
- Authors: Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin
- Published in SIGMOD, 2011
“Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities.
In this paper, we show that the minimum signature size lower bound is t + 1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study in- volving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.”
@inproceedings{DBLP:conf/sigmod/QinWLXL11,
author = {Jianbin Qin and
Wei Wang and
Yifei Lu and
Chuan Xiao and
Xuemin Lin},
title = {Efficient exact edit similarity query processing with the asymmetric
signature scheme},
booktitle = {Proceedings of the {ACM} {SIGMOD} International Conference on Management
of Data, {SIGMOD} 2011, Athens, Greece, June 12-16, 2011},
pages = {1033--1044},
year = {2011},
url = {http://doi.acm.org/10.1145/1989323.1989431},
doi = {10.1145/1989323.1989431},
timestamp = {Mon, 14 May 2012 08:45:16 +0200},
biburl = {https://dblp.org/rec/bib/conf/sigmod/QinWLXL11},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
- Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.
- Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient Similarity Joins For Near Duplicate Detection. WWW 2008: 131-140.
- Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. VLDB 2008.
- Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant: Scaling up all pairs similarity search. WWW 2007: 131-140.
- code, say
$ git clone https://github.com/qinbill/Qchunk.git
$ cd Qchunk/src/
$ make
Program Name Description
- preproc Preprocess the text file and output a index.
- search Perform the search queries.
The preprocessing dose three things:
- Sort the input data.
- Generate qchunk/qgram set.
- Calculate statistics.
Usage:
$ cat <textfile> | ./preproc -o <output_prefix> -q <qgram size>
The process will genearte a set of files that named with <output_prefix>
The query processing part takes input text from standard input. Usage:
$ cat <query_file> | ./search -i <input_prefix> -t <threshold> -c/G/C/g/b
$ ./search -h
Need input file name
usage: -t <Max Edit Distance> :Edit distance threshold for index building.>
-i <input file name> :input binary file prefix
-G Use Algoritm indexGramSuper
-C Use Algoritm indexChunkSuper
-g Use Algoritm indexGramTurbo (Default)
-c Use Algoritm indexChunkTurbo
-b Use Algoritm indexChunkTurbine
# Q: 8 // Length of Q
# Tau: 5 // The threshold.
# DataDucNum: 2000000 // Number of data.
# DataUnderflow: 0 // Number of data dumped because length less that Q*tau+1
# IndexedTokenNum: 284771740 // Unique indexed token number
# SkipListNum: 0 // Use skip list size.
# QueryNum: 1000 // Number of queries.
# UderflowQuery: 0 // Number of queries is too short.
# CandOneNum: 2710686 // Number of Candidate one. Just past prefix filtering.
# FinalResultsNum: 3618 // Number of results.
# IndexTotalTime: 17.862824 // Time used in indexing.
# SearchTotalTime: 2.907339
Last Modified: <2018-03-26 Mon> by Jianbin Qin