Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fulltext index #3915

Closed
wubx opened this issue Jan 20, 2022 Discussed in #3899 · 24 comments
Closed

Support fulltext index #3915

wubx opened this issue Jan 20, 2022 Discussed in #3899 · 24 comments
Assignees
Labels
A-query Area: databend query C-feature Category: feature community-take

Comments

@wubx
Copy link
Member

wubx commented Jan 20, 2022

Discussed in #3899

Originally posted by wubx January 19, 2022
Why we need fulltest index :

  1. In APM a lot of query like : where url like "%/user/%"
  2. The Game system of the service a lot query like : where action like "%win%"
  3. Sample fulltext search replace Es

Clickhouse support fulltext index use :

ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

Stores a Bloom filter that contains all ngrams from a block of data. Works only with datatypes: String, FixedString and Map. Can be used for optimization of EQUALS, LIKE and IN expressions.

n — ngram size,
size_of_bloom_filter_in_bytes — Bloom filter size in bytes (you can use large values here, for example, 256 or 512, because it can be compressed well).
number_of_hash_functions — The number of hash functions used in the Bloom filter.
random_seed — The seed for Bloom filter hash functions.

tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

The same as ngrambf_v1, but stores tokens instead of ngrams. Tokens are sequences separated by non-alphanumeric characters.
@BohuTANG BohuTANG added C-feature Category: feature A-query Area: databend query labels Jan 20, 2022
@BohuTANG BohuTANG changed the title support fulltext index Support fulltext index Jan 20, 2022
@BohuTANG BohuTANG mentioned this issue Jan 20, 2022
4 tasks
@zbtzbtzbt zbtzbtzbt mentioned this issue Feb 15, 2022
4 tasks
@b41sh
Copy link
Member

b41sh commented Feb 17, 2022

/assignme

@BohuTANG
Copy link
Member

BohuTANG commented Jun 1, 2022

Seems we can use tanivy, and store the full text in our fuse engine?
https://github.com/quickwit-oss/tantivy

Example:
https://github.com/quickwit-oss/tantivy/tree/main/examples

cc @b41sh

@b41sh
Copy link
Member

b41sh commented Jun 1, 2022

Seems we can use tanivy, and store the full text in our fuse engine? https://github.com/quickwit-oss/tantivy

Example: https://github.com/quickwit-oss/tantivy/tree/main/examples

cc @b41sh

It's a good idea, we can try it.

@sunisdown
Copy link
Contributor

@b41sh @BohuTANG Any plans for development?

@wubx
Copy link
Member Author

wubx commented Jun 8, 2022

@b41sh
Copy link
Member

b41sh commented Jun 8, 2022

@b41sh @BohuTANG Any plans for development?

I will start working on this issue this week.

@BohuTANG
Copy link
Member

Also cc @zhang2014

@BohuTANG
Copy link
Member

BohuTANG commented Jun 11, 2022

Benchmark

Databend

Table:

create table liket(c STRING);
insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824  rows
insert into liket select * from liket;
select count(*) from liket where c like '%kd%';
+------------+
| count()    |
+------------+
| 1073741824 |
+------------+
1 row in set (22.50 sec)
Read 1073741824 rows, 38.00 GiB in 22.392 sec., 47.95 million rows/sec., 1.70 GiB/sec.

ClickHouse with ngram index

Table:

CREATE TABLE liket
(
    `c` String,
    INDEX fulltext_index c TYPE ngrambf_v1(3, 1024, 2, 0) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY c;

insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824  rows
insert into liket select * from liket;
thinkpad :) select count(*) from liket where c like '%kd%';

SELECT count(*)
FROM liket
WHERE c LIKE '%kd%'

Query id: dce045a0-175b-4b6d-9751-0702fb3cb115

┌────count()─┐
│ 1073741824 │
└────────────┘

1 rows in set. Elapsed: 4.904 sec. Processed 1.07 billion rows, 41.88 GB (218.97 million rows/s., 8.54 GB/s.)

cc @sundy-li @b41sh

@sundy-li
Copy link
Member

sundy-li commented Jun 11, 2022

How about this sql :
select ignore( c) from liket;

I think the bottleneck is mostly on IO read.

@BohuTANG
Copy link
Member

@sundy-li

select ignore(c) from liket where c like '%kd%';

or

select ignore( c) from liket;

have a hang there :/

This test is easy to run on your local, I think there are some issues with the query.

@sundy-li
Copy link
Member

sundy-li commented Jun 11, 2022

Test on my laptap, the bottleneck is mostly on IO read & regexp match I think.
ClickHouse:

CREATE TABLE liket_noindex (c String) ENGINE = MergeTree ORDER BY c;


insert into liket  select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648);  -- 30.81 MB/s
insert into liket_noindex  select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648); -- 245.13 MB/s


-- Actually the full index did not help this query, because it can't prune any block.
select count(*) from liket where c like '%kd%';  -- 0.846 sec
select count(*) from liket_noindex where c like '%kd%'; -- 1.053 sec 


select count(ignore(c)) from liket; -- 0.434 sec
select count(ignore(c)) from liket_noindex; -- 0.552 sec

@sundy-li
Copy link
Member

sundy-li commented Jun 11, 2022

In memory bench.

databend :) select sum( to_string(number) like '98%') from numbers(5000000000);
+-------------------------------------+
| sum((to_string(number) like '98%')) |
+-------------------------------------+
|                            11111111 |
+-------------------------------------+
1 row in set (6.52 sec)
Read 5000000000 rows, 37.25 GiB in 6.517 sec., 767.27 million rows/sec., 5.72 GiB/sec.

databend :) select sum( to_string(number) like '%98%') from numbers(5000000000);

+--------------------------------------+
| sum((to_string(number) like '%98%')) |
+--------------------------------------+
|                            389599750 |
+--------------------------------------+
1 row in set (17.32 sec)
Read 5000000000 rows, 37.25 GiB in 17.314 sec., 288.78 million rows/sec., 2.15 GiB/sec.
clickhouse :) select sum( toString(number) like '98%') from numbers_mt(5000000000);

SELECT sum(toString(number) LIKE '98%')
FROM numbers_mt(5000000000)

Query id: de75caf9-57b2-4e37-b258-489b76250241

┌─sum(like(toString(number), '98%'))─┐
│                           11111111 │
└────────────────────────────────────┘

1 rows in set. Elapsed: 7.602 sec. Processed 5.00 billion rows, 40.00 GB (657.74 million rows/s., 5.26 GB/s.)

clickhouse :)  select sum( toString(number) like '%98%') from numbers_mt(5000000000);

┌─sum(like(toString(number), '%98%'))─┐
│                           389599750 │
└─────────────────────────────────────┘

1 rows in set. Elapsed: 6.529 sec. Processed 5.00 billion rows, 40.00 GB (765.77 million rows/s., 6.13 GB/s.)

@BohuTANG
Copy link
Member

Good.
Does the like function in databend have some improvement room?

@sundy-li
Copy link
Member

Yes, need profile investigation.

@sunisdown
Copy link
Contributor

Do we have RFC for the fulltext index? tanivy is a good choice and we can discuss about the design.

@b41sh
Copy link
Member

b41sh commented Jun 14, 2022

Do we have RFC for the fulltext index? tanivy is a good choice and we can discuss about the design.

tantivy only supports fs storage of data, we cannot use it directly.
Maybe we need to implement an inverted index ourselves to support full-text search.

@sundy-li
Copy link
Member

Do we have RFC for the fulltext index?

A simple RFC in Chinese.

Databend 全文索引.pdf

@sunisdown
Copy link
Contributor

Do we have RFC for the fulltext index?

A simple RFC in Chinese.

Databend 全文索引.pdf

Prefer the second solution, the first solution looks like another quickwit.

@ariesdevil
Copy link
Contributor

ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/

@sunisdown
Copy link
Contributor

The implementation is interesting in that it uses an ID to join the data from the Clickhouse query to the data from the Quickwit query. I think it can be used with a small amount of data.

ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/

@BohuTANG BohuTANG mentioned this issue Jan 21, 2023
9 tasks
@BohuTANG
Copy link
Member

Tracking at #9811

@ariesdevil
Copy link
Contributor

ClickHouse uses quickwit implements full-text search. clickhouse.com/docs/en/guides/developer/full-text-search

ClickHouse delete this content and redirect this link to the new full-text-indexing feature, lol.

@BohuTANG
Copy link
Member

qdrant supports both fulltext index and vector index:
https://qdrant.tech/documentation/indexing/

@BohuTANG
Copy link
Member

Move to #14505

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-feature Category: feature community-take
Projects
None yet
Development

No branches or pull requests

7 participants