Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Tag index #729

Merged
merged 72 commits into from
Oct 13, 2017
Merged

Tag index #729

merged 72 commits into from
Oct 13, 2017

Conversation

replay
Copy link
Contributor

@replay replay commented Sep 14, 2017

Adds a tag index, together with the functions to query it.
It does not expose any API endpoints yet, those will follow in a separate PR (from branch https://github.com/grafana/metrictank/compare/tag_api_endpoints).

tags[tagSplits[0]] = make(map[string][]string)
}

tags[tagSplits[0]][tagSplits[1]] = append(tags[tagSplits[0]][tagSplits[1]], def.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to init tags[tagSplits[0]][tagSplits[1]] first? this code will be much more readable if you say tagName := tagSplits[0] tagValue := tagSplits[1]

Copy link
Contributor Author

@replay replay Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be necessary to init slices i think, only maps. Ok, i'll name them

return m.idsByTagQuery(tree, query)
}

func (m *MemoryIdx) idsByTagQuery(tags map[string]map[string][]string, query tag.TagQuery) []string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function supposed to be returning the ids that match ANY expression or ALL expressions? Seems like it's ANY right now. I suppose this function is designed to be composable to support ALLs?

Copy link
Contributor

@DanCech DanCech Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@replay was confused, it should be ALL. He's sorting that out now.

From an end-user perspective, ANY matches can be done via a combination of using regular expression matches and/or using multiple calls to seriesByTag and combining the results with group

Copy link
Contributor Author

@replay replay Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@replay replay force-pushed the tag_index branch 3 times, most recently from d7dcf96 to d375fb9 Compare September 15, 2017 10:14
value := exprSplits[0][3]

// always anchor all regular expressions at the beginning
if operator[len(operator)-1] == byte('~') && len(value) > 0 && value[len(value)-1] != byte('^') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value[0] != byte('^')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// special case of empty value
if len(value) == 0 {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a no-op

Copy link
Contributor Author

@replay replay Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, still going to add that. the idea is that i'll just reverse the operation.

so key!=~ becomes key=~.* and key=~ becomes key!=~.*

Copy link
Contributor Author

@replay replay Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, further down where the matching is done, i could also shortcut it to not even do the matching if the pattern is .* or .+. (for .+ this works assuming that no value that's in the index can be "")

return ids, nil
}
return make(map[string]struct{}), nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be return idx[key][value], nil?

Copy link
Contributor Author

@replay replay Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return nil, err
}

for v, ids := range values {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, it seems like this could just be for v, ids := range idx[key] {

return nil, errInvalidQuery
}

exprSplits := expressionRe.FindAllStringSubmatch(expr, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value = ".+"
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better in handling terms that match the empty string, since we have a requirement that there must be a tag expression that doesn't match the empty string. Because of that requirement, we should be able to make a list of candidate series from the expression that requires a match, then use those as the list of "all" series that should be considered to match if they don't have the tags that we're matching the empty string for. I didn't implement that concept in the graphite tagdbs, but I'm going to go back and see whether I can use it.

Copy link
Contributor Author

@replay replay Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm not completely sure i understand you right. but actually i think this already does what you describe:

  • further down queries are divided into "selects" and "filters". any query that's excluding something is a "filter", any query that's selecting by a match is a "select".
  • a query like key= is translated into key!=.+, this is a filter
  • the "select" queries are always executed first, building that list of candidates you mention
  • after the list of candidates is built, the "filters" are applied

we can further speed this up by not even matching the regex .+ because we can't have empty tag values in the index. so every series that has the tag key will match the value .+, so we can shortcut it and skip the regex match.

@replay replay force-pushed the tag_index branch 2 times, most recently from ebc48cf to e30aee4 Compare September 29, 2017 15:11
@Dieterbe
Copy link
Contributor

FTR: prior discussion: #532
feel free to link to any other tickets or discussions

@replay replay changed the title [WIP] (do not merge yet) Tag index Tag index Oct 1, 2017
@replay replay force-pushed the tag_index branch 3 times, most recently from 7402db6 to a4ee0e2 Compare October 1, 2017 10:45
{"key1=~value[0-9]", "key2=~", "key3!=value3"},
{"key2=", "key1=value1"},
}
expecting := []int{3, 1, 4, 1, 0, 1, 2, 0, 1, 2}
Copy link
Contributor

@Dieterbe Dieterbe Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make 1 slice of cases, where each case shows its expression and which metrics are expected to match.
just checking the number of matches can hide bugs, is hard for readers, and having 2 structures is a bit confusing


func (c *CasIdx) IdsByTagExpressions(orgId int, expressions []string) ([]string, error) {
return c.MemoryIdx.IdsByTagExpressions(orgId, expressions)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these additions should not be necessary. go compiler should automatically wire these through since MemoryIdx is embedded into the CasIdx struct (that's why Find, List, etc work also on this index)

@@ -42,19 +42,24 @@ var (
// metric idx.metrics_active is the number of currently known metrics in the index
statMetricsActive = stats.NewGauge32("idx.metrics_active")

Enabled bool
Enabled bool
TagSupport bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to export this i think

func (m *MemoryIdx) addTags(def *schema.MetricDefinition) {
tags, ok := m.Tags[def.OrgId]
if !ok {
tags = make(map[string]map[string]map[string]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make(TagIndex) here

tagValue := tagSplits[1]

if _, ok = tags[tagName]; !ok {
tags[tagName] = make(map[string]map[string]struct{})
Copy link
Contributor

@Dieterbe Dieterbe Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should give these substructures a type, so that it's easier/clearer to create them

Copy link
Contributor Author

@replay replay Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be good, but I can't come up with a good name :)
The top level structure is currently named TagIndex. I suggest the next one could be TagKey and it's nested values would be called TagValue. So each TagValue would then contain a set of IDs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

TagIndex -> TagValues -> MetricDefinitions

@@ -130,6 +144,69 @@ func (m *MemoryIdx) Update(entry idx.Archive) {
m.Unlock()
}

func (m *MemoryIdx) addTags(def *schema.MetricDefinition) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func name is a bit confusing. we're indexing tags (or "adding tags to the index") for a metricdefinition that already has tags. so maybe call this indexTags or at least at a function comment explaining what this does. thanks :)

}
}

func (m *MemoryIdx) delTags(def *schema.MetricDefinition) {
Copy link
Contributor

@Dieterbe Dieterbe Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as above. clarify a bit via function comment, thanks (could be confused for "delete tags from a metric")

@Dieterbe
Copy link
Contributor

Dieterbe commented Oct 1, 2017

my main concern here is the amount of strings used. every string has a pointer internally, and GC workload is directly proportional to amount of live pointers on the heap. and we already have a too-high number of strings and pointers due to legacy reasons (eg all the AggMetric stuff, and other stuff in the current index), which results in a high GC workload, which combined with golang/go#14812 results occassionally in poor latencies.

so, I think we need a way to verify if this concern holds any ground, with minimal time investment.

should be pretty trivial to spin up a docker stack, use fakemetrics for ingest workload, and vegeta for query workload, and compare both cases. I can help with the specifics, i've done this a bunch of times before. see also #440 and https://github.com/grafana/metrictank/blob/7a508e43e9b48af9ecdf1723c91dd75e1b4ca212/docker/docker-cluster/load.sh

idx/idx.go Outdated
@@ -112,4 +112,7 @@ type MetricIndex interface {
Find(int, string, int64) ([]Node, error)
List(int) []Archive
Prune(int, time.Time) ([]Archive, error)
TagList(int, uint32) []string
Tag(int, string) map[string]uint32
IdsByTagExpressions(int, []string) ([]string, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 functions are not called by anything, it's not really clear what they are for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding comments

)

func ConfigSetup() {
memoryIdx := flag.NewFlagSet("memory-idx", flag.ExitOnError)
memoryIdx.BoolVar(&Enabled, "enabled", false, "")
memoryIdx.BoolVar(&tagSupport, "tag-support", false, "")
memoryIdx.IntVar(&matchCacheSize, "match-cache-size", 1000, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fill in help message, and add these settings, along with the same help message and default value to metrictank-sample.ini, then run the sync-configs script to apply it to the other files as well. and config-to-doc.sh

Dieterbe and others added 7 commits October 12, 2017 19:23
no need to loop twice
no need to do map lookups if key doesn't even match
only need to track values, not entire k=v pair
* add a corruption case metric+log
* standardize the format of all recoverable internal errors
  makes a simpler, cleaner namespace, on which we can easily alert.
* standardize terms:
  - corrupt for internal datastructures
  - invalid for user data
* no need to export vars
@@ -328,11 +318,19 @@ func (q *TagQuery) filterByMatch(resultSet TagIDs, byId map[string]*idx.Archive,
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just break? There shouldn't be another tag value for this tag key, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm github is not showing anything besides the continue line. are you talking about the case where we found an entry in notMatchingTags ?

Copy link
Contributor Author

@replay replay Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shanson7 you're right. each def should only have one value per key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i gave the inner loop a name for consistency with the outer one: 6fb73bf

@Dieterbe
Copy link
Contributor

@shanson7 this will be merged very soon now, unless you have any more feedback :)
main thing missing is we need to pull in the tag input validation raintank/schema#11

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants