Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Language sorting on Indexed data. #4316

Merged
merged 8 commits into from
Feb 13, 2020
Merged

Conversation

arijitAD
Copy link

@arijitAD arijitAD commented Nov 25, 2019

This will fix the sorting of an indexed field having lang directive.
Fixes #4005
This PR change the data format for predicate that has lang directive and exact index.
Users upgrading from older version of dgraph must rebuild the index.
Example:
name: string @index(exact) @lang .
Since name predicate has lang directive and exact index. This needs to be deleted and recreated again.
Alter the schema as
name: string @lang .
This will delete the old index data and the alter the schema as
name: string @index(exact) @lang .
This will recreate the index with new data format.


This change is Reviewable

@arijitAD arijitAD requested review from manishrjain, pawanrawal and a team as code owners November 25, 2019 13:13
@arijitAD arijitAD requested a review from poonai November 25, 2019 13:13
tok/tok.go Outdated
@@ -303,6 +307,34 @@ func (t ExactTokenizer) Identifier() byte { return IdentExact }
func (t ExactTokenizer) IsSortable() bool { return true }
func (t ExactTokenizer) IsLossy() bool { return false }

// LangTokenizer returns the exact string along with language prefix as a token.
type LangTokenizer struct {
lang string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is not gofmt-ed with -s (from gofmt)

Copy link
Contributor

@poonai poonai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions (waiting on @arijitAD, @balajijinnah, @manishrjain, and @pawanrawal)


tok/tok.go, line 308 at r1 (raw file):

func (t ExactTokenizer) Identifier() byte { return IdentExact }
func (t ExactTokenizer) IsSortable() bool { return true }
func (t ExactTokenizer) IsLossy() bool    { return false }

is ExactTokenizer used anywhere?


tok/tokens.go, line 40 at r1 (raw file):

			langTag, _ = language.Parse("en")
		}
		return LangTokenizer{lang: lang, cl: collate.New(langTag), buffer: &collate.Buffer{}}

This look little off to me. ExactTokenizer returning LangTokenizer.
May be just change ExactTokenizer if it not used anywhere


worker/sort.go, line 231 at r1 (raw file):

		lang := order.Langs[0]
		tokenizer = tok.GetLangTokenizer(tokenizer, lang)
		prefix = []byte{tokenizer.Identifier()}

Could you make changes So, that identifier it self gives the proper prefix

Copy link
Author

@arijitAD arijitAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)


tok/tok.go, line 308 at r1 (raw file):

Previously, balajijinnah (balaji) wrote…

is ExactTokenizer used anywhere?

Yes. It is used when we don't use lang directive. It could happen that a predicate doesn't have @lang directive initially and it was added later. Langtokenizer is used when we want exact indexing along with language


tok/tok.go, line 312 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

File is not gofmt-ed with -s (from gofmt)

Done.


tok/tokens.go, line 40 at r1 (raw file):

Previously, balajijinnah (balaji) wrote…

This look little off to me. ExactTokenizer returning LangTokenizer.
May be just change ExactTokenizer if it not used anywhere

GetLangTokenzier is expected to return the lang tokenizer corresponding to the language if exact indexing is used. LangTokenizer is also an ExactTokenizer.


worker/sort.go, line 231 at r1 (raw file):

Previously, balajijinnah (balaji) wrote…

Could you make changes So, that identifier it self gives the proper prefix

We don't want to change the Tokenizer interface since there are custom Tokenizers that use this interface.

Copy link
Contributor

@pawanrawal pawanrawal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok though we need to see how to do other indexes like term/fulltext/hash interact with @lang tag and if they work as expected for sorting by lang.

Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, and @manishrjain)


query/common_test.go, line 239 at r2 (raw file):

name_lang					   : string @lang .
lang_type                      : string @index(exact) .
name_lang_index				   : string @index(exact) @lang .

Fix indentation please. Also can't you use name above? It has both @lang directive and exact index.


query/common_test.go, line 242 at r2 (raw file):

alt_name                       : [string] @index(term, exact, trigram) @count .
alias                          : string @index(exact, term, fulltext) .
alias_lang					   : string @index(exact) @lang .

Fix indentation please.


query/query1_test.go, line 207 at r2 (raw file):

		{
			me(func: uid(0x01)) {
				friend(first: 2, orderdesc: alias_lang@en) {

What does sorting by alias@en return now since there are no values tagged with en language? I guess it should return empty result. Can you please add a case for that?


query/query2_test.go, line 936 at r2 (raw file):

}

func TestLanguageOrderIndexed2(t *testing.T) {

Add a third case please which tries to sort by only name_lang_index, that should return an empty result?

Also a fourth case trying to sort by name_lang_index@hi but with no values for hi in the data set. That should also return empty list.

Also add some test cases with pagination and offset.


query/query4_test.go, line 460 at r2 (raw file):

			"q": [
			  {
				"name": "name"

Do you know why is lower case before upper?


tok/tokens.go, line 35 at r2 (raw file):

		// with a different lang.
		return FullTextTokenizer{lang: lang}
	case ExactTokenizer:

What happens when users use hash or term index (instead of exact) with @lang directive? Does sorting work as expected then?


tok/tokens.go, line 36 at r2 (raw file):

		return FullTextTokenizer{lang: lang}
	case ExactTokenizer:
		langTag, err := language.Parse(lang)

When would this return an error and why do we use en as a default. Please add a comment explaining why we do this.


worker/sort.go, line 227 at r2 (raw file):

	var prefix []byte
	if len(order.Langs) > 0 {

Could you please make changes to the gql/parser to return error for queries with use more than one lang and then we can change order.Langs to order.Lang which can just be a string.


worker/sort.go, line 228 at r2 (raw file):

	var prefix []byte
	if len(order.Langs) > 0 {
		// Only one languge is allowed.

language

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a few comments. Get @pawanrawal and others to approve it.

Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, and @lang)


tok/tok.go, line 320 at r2 (raw file):

func (t LangTokenizer) Type() string { return "string" }
func (t LangTokenizer) Tokens(v interface{}) ([]string, error) {
	if term, ok := v.(string); ok {
if , ok : = ; !ok {
  return error
}

tok/tokens.go, line 38 at r2 (raw file):

		langTag, err := language.Parse(lang)
		if err != nil {
			langTag, _ = language.Parse("en")

enLangTag should be variable that you compute upfront and just use?

if err != nil { langTag = enLangTag }


worker/sort.go, line 231 at r1 (raw file):

Previously, arijitAD (Arijit Das) wrote…

We don't want to change the Tokenizer interface since there are custom Tokenizers that use this interface.

But, if you know this is lang tokenizer, you could type cast it and then call a function specific to that struct. We shouldn't be exposing this prefix business here.

lt, ok := tokenizer.(*LangTokenizer)
if !ok { 
  // do something here
}
lt.Prefix()

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't merge for now. We'd need a script to do the migration of data, if a user has a language directive.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, and @lang)

Copy link
Author

@arijitAD arijitAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 11 files reviewed, 16 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)


query/common_test.go, line 239 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Fix indentation please. Also can't you use name above? It has both @lang directive and exact index.

It can be used but the name already has lot of data indexed and I want to compare two specific string whose order different for different languages.


query/common_test.go, line 242 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Fix indentation please.

Done.


query/query1_test.go, line 207 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

What does sorting by alias@en return now since there are no values tagged with en language? I guess it should return empty result. Can you please add a case for that?

TestToFastJSONOrderName() this test already doesn't that. It doesn't return empty as data is added to alias as well.


query/query2_test.go, line 936 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Add a third case please which tries to sort by only name_lang_index, that should return an empty result?

Also a fourth case trying to sort by name_lang_index@hi but with no values for hi in the data set. That should also return empty list.

Also add some test cases with pagination and offset.

Done.


query/query4_test.go, line 460 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Do you know why is lower case before upper?

Because it is using the bytewise comparator and ASCII upper case has lower value than the upper case.


tok/tok.go, line 320 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…
if , ok : = ; !ok {
  return error
}

Done.


tok/tokens.go, line 35 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

What happens when users use hash or term index (instead of exact) with @lang directive? Does sorting work as expected then?

We support @lang for only sortable tokenizers. Hash and term index are not sortable.


tok/tokens.go, line 36 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

When would this return an error and why do we use en as a default. Please add a comment explaining why we do this.

Done.


tok/tokens.go, line 38 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

enLangTag should be variable that you compute upfront and just use?

if err != nil { langTag = enLangTag }

Done.


worker/sort.go, line 231 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

But, if you know this is lang tokenizer, you could type cast it and then call a function specific to that struct. We shouldn't be exposing this prefix business here.

lt, ok := tokenizer.(*LangTokenizer)
if !ok { 
  // do something here
}
lt.Prefix()

Done.


worker/sort.go, line 227 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Could you please make changes to the gql/parser to return error for queries with use more than one lang and then we can change order.Langs to order.Lang which can just be a string.

I am already returning an error if there are multiple languages. But changing order.Langs to order.Lang might affect multiple files. I can do that as apart of another PR.


worker/sort.go, line 228 at r2 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

language

Done.


worker/sort.go, line 230 at r2 (raw file):

efix = []byte{tokenizer.Identifier()

@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from 3cd489a to 4c31aba Compare December 9, 2019 10:48
Copy link
Contributor

@pawanrawal pawanrawal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once we have the migration story figured out, we can release it.

Reviewable status: 5 of 11 files reviewed, 8 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)


worker/sort.go, line 234 at r3 (raw file):

		if !ok {
			return resultWithError(errors.Errorf(
				"Invalid tokenizer for language %s.", lang))

Please add the name of the tokenizer to the error as well

@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from 04bc23c to 0f4c344 Compare December 9, 2019 17:21
Copy link
Author

@arijitAD arijitAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 11 files reviewed, 8 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, and @pawanrawal)


worker/sort.go, line 234 at r3 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Please add the name of the tokenizer to the error as well

Done.

@arijitAD arijitAD changed the title Langauge sorting on Indexed data. [BREAKING] Langauge sorting on Indexed data. Jan 15, 2020
@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from 0f4c344 to f64811f Compare January 15, 2020 08:05
@arijitAD arijitAD requested a review from martinmr as a code owner January 20, 2020 13:52
Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there.

Reviewable status: 3 of 13 files reviewed, 12 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)


tok/tok.go, line 317 at r5 (raw file):

	term := make([]byte, 0, len(t.langBase)+4+len(encodedTerm))
	term = append(term, IdentDelimiter)

You can keep the same as before.


tok/tok.go, line 326 at r5 (raw file):

}
func (t ExactTokenizer) Identifier() byte { return IdentExact }

This can be dealt with here.

Having just one tokenizer would be great.


worker/groups.go, line 1046 at r5 (raw file):

// SubscribeForUpdates will listen for updates for the given group.
func SubscribeForUpdates(prefixes [][]byte, cb func(kvs *badgerpb.KVList), group uint32) {
	return

HACK?


worker/sort.go, line 281 at r5 (raw file):

			token := k.Term
			if predicateHasLang && !queryHasLang && len(token) > 1 &&
				token[1] == tok.IdentDelimiter {

I wouldn't base anything upon the presence of it. But, it generally should help. Because, some term could also have the same character at the same place.

@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from 8df547e to 7023ed9 Compare January 21, 2020 11:57
@MichelDiz MichelDiz changed the title [BREAKING] Langauge sorting on Indexed data. [BREAKING] Language sorting on Indexed data. Jan 22, 2020
@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch 6 times, most recently from e7a1aad to c80307a Compare January 23, 2020 12:13
Copy link
Author

@arijitAD arijitAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 12 files reviewed, 11 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)


tok/tok.go, line 317 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

You can keep the same as before.

Done.


tok/tok.go, line 326 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

This can be dealt with here.

Having just one tokenizer would be great.

Done.


worker/groups.go, line 1046 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

HACK?

Done.


worker/sort.go, line 281 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

I wouldn't base anything upon the presence of it. But, it generally should help. Because, some term could also have the same character at the same place.

Done.

@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from c80307a to 38601e0 Compare January 23, 2020 12:45
Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Push this to v2.0.

Reviewed 2 of 5 files at r4, 2 of 6 files at r8, 3 of 7 files at r10, 3 of 3 files at r11.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @arijitAD, @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)


posting/index.go, line 452 at r11 (raw file):

		// We just need the tokenizer identifier for ExactTokenizer having language.
		// It will be same for all the language.
		tokenizer = tok.GetLangTokenizer(tokenizer, "en")

GetTokenizerForLang(...)


tok/tokens.go, line 45 at r11 (raw file):

			langTag = enLangTag
		}
		return ExactTokenizer{langBase: LangBase(lang), cl: collate.New(langTag),

Add a Note here that this can potentially get expensive memory-wise, so keep an eye out. If so, then we should convert it to a sync.Pool.

@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from 38601e0 to 72393ce Compare February 13, 2020 13:10
@arijitAD arijitAD force-pushed the arijitAD/indexed_sorting branch from 72393ce to 5b4e023 Compare February 13, 2020 13:15
Copy link
Author

@arijitAD arijitAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 12 files reviewed, 12 unresolved discussions (waiting on @balajijinnah, @golangcibot, @lang, @manishrjain, @martinmr, and @pawanrawal)


posting/index.go, line 452 at r11 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

GetTokenizerForLang(...)

Done.


tok/tokens.go, line 40 at r1 (raw file):

Previously, arijitAD (Arijit Das) wrote…

GetLangTokenzier is expected to return the lang tokenizer corresponding to the language if exact indexing is used. LangTokenizer is also an ExactTokenizer.

Done.


tok/tokens.go, line 45 at r11 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add a Note here that this can potentially get expensive memory-wise, so keep an eye out. If so, then we should convert it to a sync.Pool.

Done.

@arijitAD arijitAD merged commit 3dc6b9a into master Feb 13, 2020
@arijitAD arijitAD deleted the arijitAD/indexed_sorting branch February 13, 2020 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Values are not sorted properly
5 participants