packetbeat nightly151201180656 crashing psql #565

opb1978 · 2015-12-18T22:14:26Z

just updated to packetbeat 1.0.1 and checked if the issue #342 is now fixed in this version.

after running for about 10 minutes I got this error in the log file:

2015-12-18T23:07:11.921545+01:00 somehost /usr/bin/packetbeat[13560]: log.go:114: Stacktrace: /go/src/github.com/elastic/beats/libbeat/logp/log.go:114 (0x48c5c6)#12/usr/local/go/src/runtime/asm_amd64.s:437 (0x47d8fe)#12/usr/local/go/src/runtime/panic.go:423 (0x44d4f9)#12/usr/local/go/src/runtime/panic.go:18 (0x44ba39)#12/go/src/github.com/elastic/beats/packetbeat/protos/pgsql/pgsql.go:279 (0x512203)#12/go/src/github.com/elastic/beats/packetbeat/protos/pgsql/pgsql.go:610 (0x5146de)#12/go/src/github.com/elastic/beats/packetbeat/protos/pgsql/pgsql.go:707 (0x51515d)#12/go/src/github.com/elastic/beats/packetbeat/protos/tcp/tcp.go:87 (0x521093)#12/go/src/github.com/elastic/beats/packetbeat/protos/tcp/tcp.go:173 (0x5221cd)#12/go/src/github.com/elastic/beats/packetbeat/decoder/decoder.go:136 (0x6c8ad1)#12/go/src/github.com/elastic/beats/packetbeat/sniffer/sniffer.go:352 (0x5337a9)#12/go/src/github.com/elastic/beats/packetbeat/packetbeat.go:212 (0x422f2b)#12/usr/local/go/src/runtime/asm_amd64.s:1696 (0x47fc41)

seams to be still a problem here.

I can do a capture again if needed!

tsg · 2015-12-18T22:34:30Z

@opb1978 it would be really great if you could!

opb1978 · 2015-12-18T22:43:46Z

no problem, should I send it again to @andrewkroh ? I would like to gpg encrypt the file before sending...

tsg · 2015-12-18T22:47:32Z

Yeah, if you already have his gpg key, then that would be easiest. Thanks!

andrewkroh · 2015-12-18T22:50:09Z

I think I incorrectly tagged #342 with 1.0.1. I don't think the #494 fix was incorporated into 1.0.1, but is instead tagged with 1.1.0.

You could try the nightly to see if the bug is fixed there. We (mostly @urso) developed the fix based on the PCAP that was provided.

opb1978 · 2015-12-18T23:00:45Z

I actually tried nightly builds one before but the version string in the nightly builds seams to be wrong and did not want to repack the debian package.

maybe you could have a look into this some time, will do the repack now by hand:

dpkg: error processing archive packetbeat_nightly.latest_amd64.deb (--install):
parsing file '/var/lib/dpkg/tmp.ci/control' near line 2 package 'packetbeat':
error in 'Version' field string 'nightly151201180656': version number does not start with digit
Errors were encountered while processing:

andrewkroh · 2015-12-18T23:07:08Z

Yeah, sorry, that is still an open issue. elastic/beats-packer#40

opb1978 · 2015-12-21T11:37:05Z

@andrewkroh did a retest with nightly151201180656 still having a psql Problem. I will send you a download link for the pcap file.

andrewkroh · 2015-12-21T19:17:28Z

We tried your PCAP and were not able to reproduce using the latest build from master. The nightly build that you used is from 2015-12-01 (based on the filename) and the fix was not introduced until 2015-12-10.

We only store the past 2 weeks of nightly builds, so did you possibly use a version that you had downloaded in the past?

urso · 2015-12-21T19:37:01Z

@opb1978 please get the most recent nightly. I checked builds up to 2015-12-10 being able to reproduce the original panic. More recent builds should be fine.

opb1978 · 2015-12-21T20:26:05Z

sorry for the confusion will retry with the latest nightly build. You where right I downloaded before and repacked the wrong version. Will update here soon!

opb1978 · 2015-12-22T11:15:16Z

did some retesting with the nightly build and got again some errors. @andrewkroh I have sent you a download link yesterday.

andrewkroh · 2015-12-31T17:31:48Z

Right after I received the latest PCAP, I give it a try (but I forgot to update this issue). I was not able to reproduce any panics with it.

urso also tried the PCAP and could not reproduce.

andrewkroh · 2015-12-31T17:53:22Z

We were just chatting about this and @urso came up with a theory that we should investigate further. It might explain why we can reproduce it from the PCAP, but you are seeing an issue in production. If the pgsql transactions are growing larger than 10MB, then the stream is dropped. But if there is some faulty state management (i.e. the state is not reset properly) then this could lead to potential issues.

opb1978 · 2015-12-31T18:11:36Z

Do you need any more tests for tracking down this problem? We could also do some remote testing on our Systems.

urso · 2016-01-04T12:17:23Z

Thanks for your help. Unfortunately my theory was wrong.

To track down the issue I need a trace reliably reproducing the issue, so I can minify the trace until I can identify the problem.

One can test a pcap in bash/zsh with:

$ packetbeat -e -N -t -I trace.pcap |& grep panic

We can build a small script creating and testing a dump for some Stacktrace:

#!/bin/sh

IFC=${1:-eth0}
PCAP=${2:-trace.pcap}

check() {
    packetbeat -e -t -N -I $1 2>&1 | grep "Stacktrace" | wc -l
}

while true; do
    # optionally use timeout command (try to be bc and support OS X)
    tcpdump -i $IFC -w "$PCAP" 'tcp port 5432 or tcp port 5431' &
    job=$!
    sleep 60
    kill $job; wait $job

    echo "check for Stacktrace"
    count=$(check "$PCAP")
    if [ "$count" -ge 1 ]; then
       echo "found Stacktrace. Quitting"
       break
    fi
done

This script will create a trace for 60 seconds and checks if the traces generates an error by running packetbeat with -t -N -I trace.pcap . -N and -I guarantee this packetbeat instance is reading packets from trace file only and will not forward any events to elasticsearch/logstash. You can run the script next to your running packetbeat instance (still memory/disk/cpu will be used to create the trace). Update check function and time intervals if required.

opb1978 · 2016-01-04T17:06:52Z

I have been running this script now since hours and no errors. If I start packetbeat again normally the problem occurs after some minutes.

I have been capturing on interface "any" because this is how packetbeat would be running. I have put the script into a screen, maybe it will produce the problem after some time.

Just a guess, maybe the problem is occurring while transferring to elasticsearch? As I disabled the normal process (causing to many errors and SMS) we could start the replay of the pcap file with transferring to elastic search. I can easily remove this sheds again.

urso · 2016-01-04T20:17:54Z

Hmm... bug seems to be hiding. Problem is, if we run with '-t', we alter timestamps and timely behavior.

So another options would be to modify the script to:

remote '-t -N' from check function => transactions are send to elasticsearch + timing more similar to original capture.
increase capture duration: sleep $(($DURATION * 60))
use interface 'any' (using any with tcpdump will not set the devices into promiscuous mode) or list all devices with '-i name'

Doing changes 1 and 2, the script will capture traffic for DURATION minutes and afterwards check the pcap for DURATION minutes.

urso · 2016-01-22T20:51:32Z

Good news, I've finally got a trace send reproducing the error. Put quite some effort into hardening the pgsql parser today. See #825

tsg added bug Packetbeat labels Dec 18, 2015

opb1978 changed the title ~~packetbeat 1.0.1 crashing psql~~ packetbeat nightly151201180656 crashing psql Dec 21, 2015

andrewkroh assigned urso Dec 31, 2015

urso mentioned this issue Jan 22, 2016

harden pgsql parser #825

Merged

andrewkroh closed this as completed in #825 Jan 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

packetbeat nightly151201180656 crashing psql #565

packetbeat nightly151201180656 crashing psql #565

opb1978 commented Dec 18, 2015

tsg commented Dec 18, 2015

opb1978 commented Dec 18, 2015

tsg commented Dec 18, 2015

andrewkroh commented Dec 18, 2015

opb1978 commented Dec 18, 2015

andrewkroh commented Dec 18, 2015

opb1978 commented Dec 21, 2015

andrewkroh commented Dec 21, 2015

urso commented Dec 21, 2015

opb1978 commented Dec 21, 2015

opb1978 commented Dec 22, 2015

andrewkroh commented Dec 31, 2015

andrewkroh commented Dec 31, 2015

opb1978 commented Dec 31, 2015

urso commented Jan 4, 2016

opb1978 commented Jan 4, 2016

urso commented Jan 4, 2016

urso commented Jan 22, 2016

packetbeat nightly151201180656 crashing psql #565

packetbeat nightly151201180656 crashing psql #565

Comments

opb1978 commented Dec 18, 2015

tsg commented Dec 18, 2015

opb1978 commented Dec 18, 2015

tsg commented Dec 18, 2015

andrewkroh commented Dec 18, 2015

opb1978 commented Dec 18, 2015

andrewkroh commented Dec 18, 2015

opb1978 commented Dec 21, 2015

andrewkroh commented Dec 21, 2015

urso commented Dec 21, 2015

opb1978 commented Dec 21, 2015

opb1978 commented Dec 22, 2015

andrewkroh commented Dec 31, 2015

andrewkroh commented Dec 31, 2015

opb1978 commented Dec 31, 2015

urso commented Jan 4, 2016

opb1978 commented Jan 4, 2016

urso commented Jan 4, 2016

urso commented Jan 22, 2016