json_query not parsing through all entries in the JSON array #6138

harnitsignalfx · 2019-07-18T16:14:44Z

Hi, I think there is a bug in how JSON array entries are being parsed by JSONQuery. Would really appreciate guidance on whether I've made a configuration error or this is indeed reproducible by other folks.

Relevant telegraf.conf:

[[inputs.tail]]
  files = ["/test.log"]
  watch_method = "poll"
  data_format = "json"
  name_override = "test-key"
  tag_keys = ["first"]
  json_query = "friends"

System info:

Telegraf v 1.11.2
OS: Debian 9

Steps to reproduce:

Follow the example for JSON Query -
https://github.com/influxdata/telegraf/tree/master/plugins/parsers/json#query
So a sample entry in the file would be -

{"friends": [{"first": "Dale", "last": "Murphy", "age": 44},{"first": "Roger", "last": "Craig", "age": 68},{"first": "Jane", "last": "Murphy", "age": 47}]}

Expected behavior:

All entries in the array are parsed (very similar to the posted example)

{"fields":{"age":44},"name":"test-key","tags":{"first":"Dale","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}
{"fields":{"age":68},"name":"test-key","tags":{"first":"Roger","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}
{"fields":{"age":47},"name":"test-key","tags":{"first":"Jane","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}

Actual behavior:

Only the first array entry is being parsed.
{"fields":{"age":44},"name":"test-key","tags":{"first":"Dale","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-07-18T17:36:57Z

This is a bug in the tail plugin caused by some special logic for the csv parser when reading the first line of the file. We'll put together a fix for the next release.

harnitsignalfx · 2019-07-18T17:44:25Z

Appreciate your looking into it. Thanks Daniel!

GeorgeMac · 2019-07-22T10:00:52Z

@danielnelson been playing about with tests on this one and come to realise that there is a little interface design discrepancy going on between the tail plugin, the parser interface and the json implementation.

The parser interface assumers 1 metric per line:

telegraf/plugins/parsers/registry.go

Lines 38 to 58 in 877c423

    
           // Parser is an interface defining functions that a parser plugin must satisfy. 
        
           type Parser interface { 
        
           	// Parse takes a byte buffer separated by newlines 
        
           	// ie, `cpu.usage.idle 90\ncpu.usage.busy 10` 
        
           	// and parses it into telegraf metrics 
        
           	// 
        
           	// Must be thread-safe. 
        
           	Parse(buf []byte) ([]telegraf.Metric, error) 
        
           	// ParseLine takes a single string metric 
        
           	// ie, "cpu.usage.idle 90" 
        
           	// and parses it into a telegraf metric. 
        
           	// 
        
           	// Must be thread-safe. 
        
           	ParseLine(line string) (telegraf.Metric, error) 
        
           	// SetDefaultTags tells the parser to add all of the given tags 
        
           	// to each parsed metric. 
        
           	// NOTE: do _not_ modify the map after you've passed it here!! 
        
           	SetDefaultTags(tags map[string]string) 
        
           }

Parse can take multiple lines and return multiple metrics. Which could potentially be used to circumvent this design constraint. However, as it stands, the tail parsers attempts to parse multiple lines using Parse and after it finds the first line it uses ParseLine from there on out. So a single line like in the above example, which produces > 1 metrics will always discard the rest.

telegraf/plugins/inputs/tail/tail.go

Lines 200 to 213 in 877c423

    
           if firstLine { 
        
           	metrics, err = parser.Parse([]byte(text)) 
        
           	if err == nil { 
        
           		if len(metrics) == 0 { 
        
           			firstLine = false 
        
           			continue 
        
           		} else { 
        
           			m = metrics[0] 
        
           		} 
        
           	} 
        
           	firstLine = false 
        
           } else { 
        
           	m, err = parser.ParseLine(text) 
        
           }

This leads me to think that in the long run perhaps the interface needs to be redesigned in order to acknowledge that a single newline delimited line can produce > 1 metrics.
However, in the near term we could perhaps fall back to using parser.Parse over parser.ParseLine and return all metrics?

danielnelson · 2019-07-22T20:27:42Z

Yes, I'm hoping to remove the ParseLine function from this interface. Since we want to support arbitrary binary formats we shouldn't have a line based method; the containing plugin should handle transport level framing and the parser should understand inter-metric level framing.

First though we need a solution for parser state. The tail plugin is using the ParseLine function to check headers only once per file, and it would be easy to remove this, but we also need to consider when the parser is used by other plugins such as the file plugin.

In the tail case we want to call Parse multiple times but treat the input as a single document, in the file case we want to have each call to Parse be an new document. I think we need to switch to the SetParserFunc method in all plugins so that the plugin can decide when to reset by creating a new instance. Then we can push the state tracking (are we on the first line or not) into the parsers.

GeorgeMac · 2019-07-23T14:29:12Z

I get you. The tail plugin processes line by line. The file plugin processes entire document contents in one. However, both depend upon parser interface.

The parser needs to be able to handle being called either line by line or entire documents and be able to support the effect of a parser changing state when called in a line by line fashion (i.e. csv first line means header is parsed and then that effects all the following metrics).

Currently, the file input plugin uses SetParser which means a single parser is used for multiple documents. Where tail uses SetParserFunc and instantiates a parser per document. We need the file parser to do the same, in order to create a parser per document (file) and ensure that a stateful parser isn't shared between different documents. Which would break in situations like csv parser. Where each document has different headings at the start of each file.

Once we have this we can move the acknowledgement of a header being parsed into the state of the csv parser. Meaning the first call to Parse behaves diferently to the subsequent calls.
Then the tail plugin can do away with calling Parse and then ParseLine. And it can collect all the returned metrics from just repeatedly Parse().

Correct me if I am misunderstanding anything.

danielnelson self-assigned this Jul 18, 2019

danielnelson added the bug unexpected problem or unintended behavior label Jul 18, 2019

danielnelson added this to the 1.11.3 milestone Jul 18, 2019

danielnelson added the area/tail label Jul 18, 2019

danielnelson removed their assignment Jul 18, 2019

GeorgeMac self-assigned this Jul 19, 2019

GeorgeMac mentioned this issue Jul 23, 2019

Allow stateful parsers #6160

Closed

4 tasks

danielnelson modified the milestones: 1.11.3, 1.12.0 Jul 23, 2019

danielnelson assigned danielnelson and unassigned GeorgeMac Aug 20, 2019

danielnelson mentioned this issue Aug 20, 2019

Fix parsing multiple metrics on the first line of tailed file #6289

Merged

3 tasks

danielnelson closed this as completed in #6289 Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_query not parsing through all entries in the JSON array #6138

json_query not parsing through all entries in the JSON array #6138

harnitsignalfx commented Jul 18, 2019

danielnelson commented Jul 18, 2019 •

edited

Loading

harnitsignalfx commented Jul 18, 2019

GeorgeMac commented Jul 22, 2019

danielnelson commented Jul 22, 2019

GeorgeMac commented Jul 23, 2019 •

edited

Loading

json_query not parsing through all entries in the JSON array #6138

json_query not parsing through all entries in the JSON array #6138

Comments

harnitsignalfx commented Jul 18, 2019

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

danielnelson commented Jul 18, 2019 • edited Loading

harnitsignalfx commented Jul 18, 2019

GeorgeMac commented Jul 22, 2019

danielnelson commented Jul 22, 2019

GeorgeMac commented Jul 23, 2019 • edited Loading

danielnelson commented Jul 18, 2019 •

edited

Loading

GeorgeMac commented Jul 23, 2019 •

edited

Loading