[BREAKING] Language sorting on Indexed data. #4316

arijitAD · 2019-11-25T13:13:12Z

This will fix the sorting of an indexed field having lang directive.
Fixes #4005
This PR change the data format for predicate that has lang directive and exact index.
Users upgrading from older version of dgraph must rebuild the index.
Example:
name: string @index(exact) @lang .
Since name predicate has lang directive and exact index. This needs to be deleted and recreated again.
Alter the schema as
name: string @lang .
This will delete the old index data and the alter the schema as
name: string @index(exact) @lang .
This will recreate the index with new data format.

This change is

golangcibot · 2019-11-25T13:15:49Z

tok/tok.go

@@ -303,6 +307,34 @@ func (t ExactTokenizer) Identifier() byte { return IdentExact }
 func (t ExactTokenizer) IsSortable() bool { return true }
 func (t ExactTokenizer) IsLossy() bool    { return false }

+// LangTokenizer returns the exact string along with language prefix as a token.
+type LangTokenizer struct {
+	lang string


File is not gofmt-ed with -s (from gofmt)

poonai

Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions (waiting on @arijitAD, @balajijinnah, @manishrjain, and @pawanrawal)

tok/tok.go, line 308 at r1 (raw file):

func (t ExactTokenizer) Identifier() byte { return IdentExact }
func (t ExactTokenizer) IsSortable() bool { return true }
func (t ExactTokenizer) IsLossy() bool    { return false }

is ExactTokenizer used anywhere?

tok/tokens.go, line 40 at r1 (raw file):

			langTag, _ = language.Parse("en")
		}
		return LangTokenizer{lang: lang, cl: collate.New(langTag), buffer: &collate.Buffer{}}

This look little off to me. ExactTokenizer returning LangTokenizer.
May be just change ExactTokenizer if it not used anywhere

worker/sort.go, line 231 at r1 (raw file):

		lang := order.Langs[0]
		tokenizer = tok.GetLangTokenizer(tokenizer, lang)
		prefix = []byte{tokenizer.Identifier()}

Could you make changes So, that identifier it self gives the proper prefix

arijitAD

Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)

tok/tok.go, line 308 at r1 (raw file):

Previously, balajijinnah (balaji) wrote…

is ExactTokenizer used anywhere?

Yes. It is used when we don't use lang directive. It could happen that a predicate doesn't have @lang directive initially and it was added later. Langtokenizer is used when we want exact indexing along with language

tok/tok.go, line 312 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

File is not gofmt-ed with -s (from gofmt)

Done.

tok/tokens.go, line 40 at r1 (raw file):

Previously, balajijinnah (balaji) wrote…

This look little off to me. ExactTokenizer returning LangTokenizer.
May be just change ExactTokenizer if it not used anywhere

GetLangTokenzier is expected to return the lang tokenizer corresponding to the language if exact indexing is used. LangTokenizer is also an ExactTokenizer.

worker/sort.go, line 231 at r1 (raw file):

Previously, balajijinnah (balaji) wrote…

Could you make changes So, that identifier it self gives the proper prefix

We don't want to change the Tokenizer interface since there are custom Tokenizers that use this interface.

pawanrawal

Looks ok though we need to see how to do other indexes like term/fulltext/hash interact with @lang tag and if they work as expected for sorting by lang.

Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, and @manishrjain)

query/common_test.go, line 239 at r2 (raw file):

name_lang					   : string @lang .
lang_type                      : string @index(exact) .
name_lang_index				   : string @index(exact) @lang .

Fix indentation please. Also can't you use name above? It has both @lang directive and exact index.

query/common_test.go, line 242 at r2 (raw file):

alt_name                       : [string] @index(term, exact, trigram) @count .
alias                          : string @index(exact, term, fulltext) .
alias_lang					   : string @index(exact) @lang .

Fix indentation please.

query/query1_test.go, line 207 at r2 (raw file):

		{
			me(func: uid(0x01)) {
				friend(first: 2, orderdesc: alias_lang@en) {

What does sorting by alias@en return now since there are no values tagged with en language? I guess it should return empty result. Can you please add a case for that?

query/query2_test.go, line 936 at r2 (raw file):

}

func TestLanguageOrderIndexed2(t *testing.T) {

Add a third case please which tries to sort by only name_lang_index, that should return an empty result?

Also a fourth case trying to sort by name_lang_index@hi but with no values for hi in the data set. That should also return empty list.

Also add some test cases with pagination and offset.

query/query4_test.go, line 460 at r2 (raw file):

			"q": [
			  {
				"name": "name"

Do you know why is lower case before upper?

tok/tokens.go, line 35 at r2 (raw file):

		// with a different lang.
		return FullTextTokenizer{lang: lang}
	case ExactTokenizer:

What happens when users use hash or term index (instead of exact) with @lang directive? Does sorting work as expected then?

tok/tokens.go, line 36 at r2 (raw file):

		return FullTextTokenizer{lang: lang}
	case ExactTokenizer:
		langTag, err := language.Parse(lang)

When would this return an error and why do we use en as a default. Please add a comment explaining why we do this.

worker/sort.go, line 227 at r2 (raw file):

	var prefix []byte
	if len(order.Langs) > 0 {

Could you please make changes to the gql/parser to return error for queries with use more than one lang and then we can change order.Langs to order.Lang which can just be a string.

worker/sort.go, line 228 at r2 (raw file):

	var prefix []byte
	if len(order.Langs) > 0 {
		// Only one languge is allowed.

language

manishrjain

Got a few comments. Get @pawanrawal and others to approve it.

Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, and @lang)

tok/tok.go, line 320 at r2 (raw file):

func (t LangTokenizer) Type() string { return "string" }
func (t LangTokenizer) Tokens(v interface{}) ([]string, error) {
	if term, ok := v.(string); ok {

if , ok : = ; !ok {
  return error
}

tok/tokens.go, line 38 at r2 (raw file):

		langTag, err := language.Parse(lang)
		if err != nil {
			langTag, _ = language.Parse("en")

enLangTag should be variable that you compute upfront and just use?

if err != nil { langTag = enLangTag }

worker/sort.go, line 231 at r1 (raw file):

Previously, arijitAD (Arijit Das) wrote…

We don't want to change the Tokenizer interface since there are custom Tokenizers that use this interface.

But, if you know this is lang tokenizer, you could type cast it and then call a function specific to that struct. We shouldn't be exposing this prefix business here.

lt, ok := tokenizer.(*LangTokenizer)
if !ok { 
  // do something here
}
lt.Prefix()

manishrjain

Don't merge for now. We'd need a script to do the migration of data, if a user has a language directive.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, and @lang)

arijitAD

Reviewable status: 5 of 11 files reviewed, 16 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)

query/common_test.go, line 239 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Fix indentation please. Also can't you use name above? It has both @lang directive and exact index.

It can be used but the name already has lot of data indexed and I want to compare two specific string whose order different for different languages.

query/common_test.go, line 242 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Fix indentation please.

Done.

query/query1_test.go, line 207 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

What does sorting by alias@en return now since there are no values tagged with en language? I guess it should return empty result. Can you please add a case for that?

TestToFastJSONOrderName() this test already doesn't that. It doesn't return empty as data is added to alias as well.

query/query2_test.go, line 936 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Add a third case please which tries to sort by only name_lang_index, that should return an empty result?

Also a fourth case trying to sort by name_lang_index@hi but with no values for hi in the data set. That should also return empty list.

Also add some test cases with pagination and offset.

Done.

query/query4_test.go, line 460 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Do you know why is lower case before upper?

Because it is using the bytewise comparator and ASCII upper case has lower value than the upper case.

tok/tok.go, line 320 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

if , ok : = ; !ok {
  return error
}

Done.

tok/tokens.go, line 35 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

What happens when users use hash or term index (instead of exact) with @lang directive? Does sorting work as expected then?

We support @lang for only sortable tokenizers. Hash and term index are not sortable.

tok/tokens.go, line 36 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

When would this return an error and why do we use en as a default. Please add a comment explaining why we do this.

Done.

tok/tokens.go, line 38 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

enLangTag should be variable that you compute upfront and just use?

if err != nil { langTag = enLangTag }

Done.

worker/sort.go, line 231 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

But, if you know this is lang tokenizer, you could type cast it and then call a function specific to that struct. We shouldn't be exposing this prefix business here.
lt, ok := tokenizer.(*LangTokenizer)
if !ok { 
  // do something here
}
lt.Prefix()

Done.

worker/sort.go, line 227 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Could you please make changes to the gql/parser to return error for queries with use more than one lang and then we can change order.Langs to order.Lang which can just be a string.

I am already returning an error if there are multiple languages. But changing order.Langs to order.Lang might affect multiple files. I can do that as apart of another PR.

worker/sort.go, line 228 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

language

Done.

worker/sort.go, line 230 at r2 (raw file):

efix = []byte{tokenizer.Identifier()

pawanrawal

once we have the migration story figured out, we can release it.

Reviewable status: 5 of 11 files reviewed, 8 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)

worker/sort.go, line 234 at r3 (raw file):

		if !ok {
			return resultWithError(errors.Errorf(
				"Invalid tokenizer for language %s.", lang))

Please add the name of the tokenizer to the error as well

arijitAD

Reviewable status: 5 of 11 files reviewed, 8 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)

worker/sort.go, line 234 at r3 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Please add the name of the tokenizer to the error as well

Done.

manishrjain

Getting there.

Reviewable status: 3 of 13 files reviewed, 12 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)

tok/tok.go, line 317 at r5 (raw file):

	term := make([]byte, 0, len(t.langBase)+4+len(encodedTerm))
	term = append(term, IdentDelimiter)

You can keep the same as before.

tok/tok.go, line 326 at r5 (raw file):

}
func (t ExactTokenizer) Identifier() byte { return IdentExact }

This can be dealt with here.

Having just one tokenizer would be great.

worker/groups.go, line 1046 at r5 (raw file):

// SubscribeForUpdates will listen for updates for the given group.
func SubscribeForUpdates(prefixes [][]byte, cb func(kvs *badgerpb.KVList), group uint32) {
	return

HACK?

worker/sort.go, line 281 at r5 (raw file):

			token := k.Term
			if predicateHasLang && !queryHasLang && len(token) > 1 &&
				token[1] == tok.IdentDelimiter {

I wouldn't base anything upon the presence of it. But, it generally should help. Because, some term could also have the same character at the same place.

arijitAD

Reviewable status: 3 of 12 files reviewed, 11 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)

tok/tok.go, line 317 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

You can keep the same as before.

Done.

tok/tok.go, line 326 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This can be dealt with here.

Having just one tokenizer would be great.

Done.

worker/groups.go, line 1046 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

HACK?

Done.

worker/sort.go, line 281 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

I wouldn't base anything upon the presence of it. But, it generally should help. Because, some term could also have the same character at the same place.

Done.

manishrjain

Push this to v2.0.

Reviewed 2 of 5 files at r4, 2 of 6 files at r8, 3 of 7 files at r10, 3 of 3 files at r11.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)

posting/index.go, line 452 at r11 (raw file):

		// We just need the tokenizer identifier for ExactTokenizer having language.
		// It will be same for all the language.
		tokenizer = tok.GetLangTokenizer(tokenizer, "en")

GetTokenizerForLang(...)

tok/tokens.go, line 45 at r11 (raw file):

			langTag = enLangTag
		}
		return ExactTokenizer{langBase: LangBase(lang), cl: collate.New(langTag),

Add a Note here that this can potentially get expensive memory-wise, so keep an eye out. If so, then we should convert it to a sync.Pool.

arijitAD

Reviewable status: 10 of 12 files reviewed, 12 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)

posting/index.go, line 452 at r11 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

GetTokenizerForLang(...)

Done.

tok/tokens.go, line 40 at r1 (raw file):

Previously, arijitAD (Arijit Das) wrote…

GetLangTokenzier is expected to return the lang tokenizer corresponding to the language if exact indexing is used. LangTokenizer is also an ExactTokenizer.

Done.

tok/tokens.go, line 45 at r11 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add a Note here that this can potentially get expensive memory-wise, so keep an eye out. If so, then we should convert it to a sync.Pool.

Done.

arijitAD requested review from manishrjain, pawanrawal and a team as code owners November 25, 2019 13:13

arijitAD requested a review from poonai November 25, 2019 13:13

golangcibot reviewed Nov 25, 2019

View reviewed changes

poonai suggested changes Nov 25, 2019

View reviewed changes

arijitAD commented Nov 26, 2019

View reviewed changes

pawanrawal suggested changes Nov 26, 2019

View reviewed changes

manishrjain approved these changes Nov 27, 2019

View reviewed changes

manishrjain reviewed Nov 27, 2019

View reviewed changes

arijitAD commented Dec 9, 2019

View reviewed changes

arijitAD force-pushed the arijitAD/indexed_sorting branch from 3cd489a to 4c31aba Compare December 9, 2019 10:48

pawanrawal approved these changes Dec 9, 2019

View reviewed changes

arijitAD force-pushed the arijitAD/indexed_sorting branch from 04bc23c to 0f4c344 Compare December 9, 2019 17:21

arijitAD commented Dec 9, 2019

View reviewed changes

arijitAD changed the title ~~Langauge sorting on Indexed data.~~ [BREAKING] Langauge sorting on Indexed data. Jan 15, 2020

arijitAD force-pushed the arijitAD/indexed_sorting branch from 0f4c344 to f64811f Compare January 15, 2020 08:05

arijitAD requested a review from martinmr as a code owner January 20, 2020 13:52

manishrjain suggested changes Jan 20, 2020

View reviewed changes

arijitAD force-pushed the arijitAD/indexed_sorting branch from 8df547e to 7023ed9 Compare January 21, 2020 11:57

MichelDiz changed the title ~~[BREAKING] Langauge sorting on Indexed data.~~ [BREAKING] Language sorting on Indexed data. Jan 22, 2020

arijitAD force-pushed the arijitAD/indexed_sorting branch 6 times, most recently from e7a1aad to c80307a Compare January 23, 2020 12:13

arijitAD commented Jan 23, 2020

View reviewed changes

arijitAD force-pushed the arijitAD/indexed_sorting branch from c80307a to 38601e0 Compare January 23, 2020 12:45

manishrjain approved these changes Jan 23, 2020

View reviewed changes

Arijit Das added 4 commits February 13, 2020 18:40

Langauge sorting on Indexed data.

8e0177c

Fixed formmating issues with gofmt.

11e93be

Minor changes and added more test cases.

0fef33d

Lang test.

f541d14

arijitAD force-pushed the arijitAD/indexed_sorting branch from 38601e0 to 72393ce Compare February 13, 2020 13:10

Add support for language in Exact Tokenizer

5b4e023

arijitAD force-pushed the arijitAD/indexed_sorting branch from 72393ce to 5b4e023 Compare February 13, 2020 13:15

Addressed comments.

91bb608

arijitAD commented Feb 13, 2020

View reviewed changes

Arijit Das added 2 commits February 13, 2020 23:55

Fix prefix in language sorting.

6802dfb

Added few more tests.

b58a581

arijitAD merged commit 3dc6b9a into master Feb 13, 2020

arijitAD deleted the arijitAD/indexed_sorting branch February 13, 2020 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] Language sorting on Indexed data. #4316

[BREAKING] Language sorting on Indexed data. #4316

arijitAD commented Nov 25, 2019 •

edited

Loading

golangcibot Nov 25, 2019

poonai left a comment

arijitAD left a comment

pawanrawal left a comment

manishrjain left a comment

manishrjain left a comment

arijitAD left a comment

pawanrawal left a comment

arijitAD left a comment

manishrjain left a comment

arijitAD left a comment

manishrjain left a comment

arijitAD left a comment

[BREAKING] Language sorting on Indexed data. #4316

[BREAKING] Language sorting on Indexed data. #4316

Conversation

arijitAD commented Nov 25, 2019 • edited Loading

golangcibot Nov 25, 2019

Choose a reason for hiding this comment

poonai left a comment

Choose a reason for hiding this comment

arijitAD left a comment

Choose a reason for hiding this comment

pawanrawal left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

arijitAD left a comment

Choose a reason for hiding this comment

pawanrawal left a comment

Choose a reason for hiding this comment

arijitAD left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

arijitAD left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

arijitAD left a comment

Choose a reason for hiding this comment

arijitAD commented Nov 25, 2019 •

edited

Loading