-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] Language sorting on Indexed data. #4316
Conversation
tok/tok.go
Outdated
@@ -303,6 +307,34 @@ func (t ExactTokenizer) Identifier() byte { return IdentExact } | |||
func (t ExactTokenizer) IsSortable() bool { return true } | |||
func (t ExactTokenizer) IsLossy() bool { return false } | |||
|
|||
// LangTokenizer returns the exact string along with language prefix as a token. | |||
type LangTokenizer struct { | |||
lang string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is not gofmt
-ed with -s
(from gofmt
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions (waiting on @arijitAD, @balajijinnah, @manishrjain, and @pawanrawal)
tok/tok.go, line 308 at r1 (raw file):
func (t ExactTokenizer) Identifier() byte { return IdentExact } func (t ExactTokenizer) IsSortable() bool { return true } func (t ExactTokenizer) IsLossy() bool { return false }
is ExactTokenizer used anywhere?
tok/tokens.go, line 40 at r1 (raw file):
langTag, _ = language.Parse("en") } return LangTokenizer{lang: lang, cl: collate.New(langTag), buffer: &collate.Buffer{}}
This look little off to me. ExactTokenizer returning LangTokenizer.
May be just change ExactTokenizer if it not used anywhere
worker/sort.go, line 231 at r1 (raw file):
lang := order.Langs[0] tokenizer = tok.GetLangTokenizer(tokenizer, lang) prefix = []byte{tokenizer.Identifier()}
Could you make changes So, that identifier it self gives the proper prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)
tok/tok.go, line 308 at r1 (raw file):
Previously, balajijinnah (balaji) wrote…
is ExactTokenizer used anywhere?
Yes. It is used when we don't use lang directive. It could happen that a predicate doesn't have @lang directive initially and it was added later. Langtokenizer is used when we want exact indexing along with language
tok/tok.go, line 312 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
File is not
gofmt
-ed with-s
(fromgofmt
)
Done.
tok/tokens.go, line 40 at r1 (raw file):
Previously, balajijinnah (balaji) wrote…
This look little off to me. ExactTokenizer returning LangTokenizer.
May be just change ExactTokenizer if it not used anywhere
GetLangTokenzier is expected to return the lang tokenizer corresponding to the language if exact indexing is used. LangTokenizer is also an ExactTokenizer.
worker/sort.go, line 231 at r1 (raw file):
Previously, balajijinnah (balaji) wrote…
Could you make changes So, that identifier it self gives the proper prefix
We don't want to change the Tokenizer interface since there are custom Tokenizers that use this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok though we need to see how to do other indexes like term/fulltext/hash
interact with @lang
tag and if they work as expected for sorting by lang.
Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, and @manishrjain)
query/common_test.go, line 239 at r2 (raw file):
name_lang : string @lang . lang_type : string @index(exact) . name_lang_index : string @index(exact) @lang .
Fix indentation please. Also can't you use name
above? It has both @lang
directive and exact index.
query/common_test.go, line 242 at r2 (raw file):
alt_name : [string] @index(term, exact, trigram) @count . alias : string @index(exact, term, fulltext) . alias_lang : string @index(exact) @lang .
Fix indentation please.
query/query1_test.go, line 207 at r2 (raw file):
{ me(func: uid(0x01)) { friend(first: 2, orderdesc: alias_lang@en) {
What does sorting by alias@en
return now since there are no values tagged with en
language? I guess it should return empty result. Can you please add a case for that?
query/query2_test.go, line 936 at r2 (raw file):
} func TestLanguageOrderIndexed2(t *testing.T) {
Add a third case please which tries to sort by only name_lang_index
, that should return an empty result?
Also a fourth case trying to sort by name_lang_index@hi
but with no values for hi in the data set. That should also return empty list.
Also add some test cases with pagination and offset.
query/query4_test.go, line 460 at r2 (raw file):
"q": [ { "name": "name"
Do you know why is lower case before upper?
tok/tokens.go, line 35 at r2 (raw file):
// with a different lang. return FullTextTokenizer{lang: lang} case ExactTokenizer:
What happens when users use hash or term index (instead of exact) with @lang
directive? Does sorting work as expected then?
tok/tokens.go, line 36 at r2 (raw file):
return FullTextTokenizer{lang: lang} case ExactTokenizer: langTag, err := language.Parse(lang)
When would this return an error and why do we use en
as a default. Please add a comment explaining why we do this.
worker/sort.go, line 227 at r2 (raw file):
var prefix []byte if len(order.Langs) > 0 {
Could you please make changes to the gql/parser
to return error for queries with use more than one lang and then we can change order.Langs
to order.Lang
which can just be a string.
worker/sort.go, line 228 at r2 (raw file):
var prefix []byte if len(order.Langs) > 0 { // Only one languge is allowed.
language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got a few comments. Get @pawanrawal and others to approve it.
Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, and @lang)
tok/tok.go, line 320 at r2 (raw file):
func (t LangTokenizer) Type() string { return "string" } func (t LangTokenizer) Tokens(v interface{}) ([]string, error) { if term, ok := v.(string); ok {
if , ok : = ; !ok {
return error
}
tok/tokens.go, line 38 at r2 (raw file):
langTag, err := language.Parse(lang) if err != nil { langTag, _ = language.Parse("en")
enLangTag should be variable that you compute upfront and just use?
if err != nil { langTag = enLangTag }
worker/sort.go, line 231 at r1 (raw file):
Previously, arijitAD (Arijit Das) wrote…
We don't want to change the Tokenizer interface since there are custom Tokenizers that use this interface.
But, if you know this is lang tokenizer, you could type cast it and then call a function specific to that struct. We shouldn't be exposing this prefix business here.
lt, ok := tokenizer.(*LangTokenizer)
if !ok {
// do something here
}
lt.Prefix()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't merge for now. We'd need a script to do the migration of data, if a user has a language directive.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, and @lang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 11 files reviewed, 16 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)
query/common_test.go, line 239 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Fix indentation please. Also can't you use
name
above? It has both@lang
directive and exact index.
It can be used but the name already has lot of data indexed and I want to compare two specific string whose order different for different languages.
query/common_test.go, line 242 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Fix indentation please.
Done.
query/query1_test.go, line 207 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
What does sorting by
alias@en
return now since there are no values tagged withen
language? I guess it should return empty result. Can you please add a case for that?
TestToFastJSONOrderName() this test already doesn't that. It doesn't return empty as data is added to alias as well.
query/query2_test.go, line 936 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Add a third case please which tries to sort by only
name_lang_index
, that should return an empty result?Also a fourth case trying to sort by
name_lang_index@hi
but with no values for hi in the data set. That should also return empty list.Also add some test cases with pagination and offset.
Done.
query/query4_test.go, line 460 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Do you know why is lower case before upper?
Because it is using the bytewise comparator and ASCII upper case has lower value than the upper case.
tok/tok.go, line 320 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if , ok : = ; !ok { return error }
Done.
tok/tokens.go, line 35 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
What happens when users use hash or term index (instead of exact) with
@lang
directive? Does sorting work as expected then?
We support @lang for only sortable tokenizers. Hash and term index are not sortable.
tok/tokens.go, line 36 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
When would this return an error and why do we use
en
as a default. Please add a comment explaining why we do this.
Done.
tok/tokens.go, line 38 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
enLangTag should be variable that you compute upfront and just use?
if err != nil { langTag = enLangTag }
Done.
worker/sort.go, line 231 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
But, if you know this is lang tokenizer, you could type cast it and then call a function specific to that struct. We shouldn't be exposing this prefix business here.
lt, ok := tokenizer.(*LangTokenizer) if !ok { // do something here } lt.Prefix()
Done.
worker/sort.go, line 227 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Could you please make changes to the
gql/parser
to return error for queries with use more than one lang and then we can changeorder.Langs
toorder.Lang
which can just be a string.
I am already returning an error if there are multiple languages. But changing order.Langs to order.Lang might affect multiple files. I can do that as apart of another PR.
worker/sort.go, line 228 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
language
Done.
worker/sort.go, line 230 at r2 (raw file):
efix = []byte{tokenizer.Identifier()
3cd489a
to
4c31aba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we have the migration story figured out, we can release it.
Reviewable status: 5 of 11 files reviewed, 8 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)
worker/sort.go, line 234 at r3 (raw file):
if !ok { return resultWithError(errors.Errorf( "Invalid tokenizer for language %s.", lang))
Please add the name of the tokenizer to the error as well
04bc23c
to
0f4c344
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 11 files reviewed, 8 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)
worker/sort.go, line 234 at r3 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Please add the name of the tokenizer to the error as well
Done.
0f4c344
to
f64811f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there.
Reviewable status: 3 of 13 files reviewed, 12 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)
tok/tok.go, line 317 at r5 (raw file):
term := make([]byte, 0, len(t.langBase)+4+len(encodedTerm)) term = append(term, IdentDelimiter)
You can keep the same as before.
tok/tok.go, line 326 at r5 (raw file):
} func (t ExactTokenizer) Identifier() byte { return IdentExact }
This can be dealt with here.
Having just one tokenizer would be great.
worker/groups.go, line 1046 at r5 (raw file):
// SubscribeForUpdates will listen for updates for the given group. func SubscribeForUpdates(prefixes [][]byte, cb func(kvs *badgerpb.KVList), group uint32) { return
HACK?
worker/sort.go, line 281 at r5 (raw file):
token := k.Term if predicateHasLang && !queryHasLang && len(token) > 1 && token[1] == tok.IdentDelimiter {
I wouldn't base anything upon the presence of it. But, it generally should help. Because, some term could also have the same character at the same place.
8df547e
to
7023ed9
Compare
e7a1aad
to
c80307a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 12 files reviewed, 11 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)
tok/tok.go, line 317 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
You can keep the same as before.
Done.
tok/tok.go, line 326 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
This can be dealt with here.
Having just one tokenizer would be great.
Done.
worker/groups.go, line 1046 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
HACK?
Done.
worker/sort.go, line 281 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
I wouldn't base anything upon the presence of it. But, it generally should help. Because, some term could also have the same character at the same place.
Done.
c80307a
to
38601e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r4, 2 of 6 files at r8, 3 of 7 files at r10, 3 of 3 files at r11.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)
posting/index.go, line 452 at r11 (raw file):
// We just need the tokenizer identifier for ExactTokenizer having language. // It will be same for all the language. tokenizer = tok.GetLangTokenizer(tokenizer, "en")
GetTokenizerForLang(...)
tok/tokens.go, line 45 at r11 (raw file):
langTag = enLangTag } return ExactTokenizer{langBase: LangBase(lang), cl: collate.New(langTag),
Add a Note here that this can potentially get expensive memory-wise, so keep an eye out. If so, then we should convert it to a sync.Pool.
38601e0
to
72393ce
Compare
72393ce
to
5b4e023
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 10 of 12 files reviewed, 12 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)
posting/index.go, line 452 at r11 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
GetTokenizerForLang(...)
Done.
tok/tokens.go, line 40 at r1 (raw file):
Previously, arijitAD (Arijit Das) wrote…
GetLangTokenzier is expected to return the lang tokenizer corresponding to the language if exact indexing is used. LangTokenizer is also an ExactTokenizer.
Done.
tok/tokens.go, line 45 at r11 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add a Note here that this can potentially get expensive memory-wise, so keep an eye out. If so, then we should convert it to a sync.Pool.
Done.
This will fix the sorting of an indexed field having lang directive.
Fixes #4005
This PR change the data format for predicate that has lang directive and exact index.
Users upgrading from older version of dgraph must rebuild the index.
Example:
name: string @index(exact) @lang .
Since name predicate has lang directive and exact index. This needs to be deleted and recreated again.
Alter the schema as
name: string @lang .
This will delete the old index data and the alter the schema as
name: string @index(exact) @lang .
This will recreate the index with new data format.
This change is