Performance regression and seg faults moving from 0.9.5 to 1.0 #551

joe-keen · 2016-02-17T04:30:04Z

I have a test script that uses a KeyedProducer and a SimpleConsumer to test the consumer performance at a various fetch sizes starting at 50K up to 2M in increments of 25K.

In 0.9.5 I see a read speed from the SimpleConsumer of 50K to 60K messages per second depending on the fetch size. In 1.0 using the same test scripts I see speeds in the 7K to 10K range using the same objects and it seg faults before it reaches the 2M fetch size. I've seen seg faults at fetch sizes as low as 225K.

Attached is the test script I use.
kafka_test.zip

dpkp · 2016-02-17T19:01:36Z

segfault is very strange given that the module is pure python. So the first thing that jumps out at me there is ujson, which is a wrapper around a C library. I would expect that to be the most likely cause of segfaults in your script.

Also note that you do not need to import differently in 0.9.5 v. 1.0.0 -- just import everything from kafka instead of kafka.client / kafka.consumer / kafka.producer.

Thanks for the test script, I'll play around and see what I find.

joe-keen · 2016-02-17T19:08:14Z

The import difference was to be sure we were using the same client that was available in 0.9.5. ujson shouldn't have had any effect; the segfault was a few minutes after the last use of it. I was doing testing of this in a vagrant environment which you can use for your testing if you want. I was able to reproduce the segfault in multiple VMs built from the same Vagrant file.

dpkp · 2016-02-18T06:11:45Z

The performance regression seems to be due to the SO_RCVBUF handling that was added for KafkaConsumer / KafkaProducer. I can patch that to behave as before when using the old interfaces. That patch gives me 50K messages per sec in my local testing.

I have not been able to reproduce the segfault. I remain skeptical b/c there is no C code in kafka-python. Any further information you can provide on that would be helpful. Otherwise, I'm not planning to investigate that part any further.

joe-keen · 2016-02-18T14:48:22Z

I'm still able to make it segfault consistently in my Vagrant environemnt even with ujson replaced with json. I can't seem to get a viable core file out of it, most of them are truncated for some reason. With my most recent Vagrant build, using kafka version kafka_2.9.2-0.8.1.1, I was unable to get a segfault initially. I placed the test script I provided in a bash while loop and after a short while it was segfaulting every run. That behaviour remains after letting the Vagrant environment sit idle overnight.

The fd leak you fixed yesterday looks like it would be pretty minor if I'm reading it correctly. Do you think it's worth testing the kafka-python client with that fix to see if the segfaults go away?

If you want to try using the same environment you can use it with the following steps:

Clone https://github.com/openstack/monasca-api
cd to monasca-api/devstack
vagrant up
Wait about 20 minutes
After the vagrant environment is loaded kafka and zookeeper are up and running and your home directory is mounted under /vagrant_home.
I was simply creating a virtual environment, installing kafka-python 1.0, and running the test script I provided to cause the segfaults.

dpkp · 2016-02-18T15:18:23Z

The fd fix only affects the new classes. You are testing with the old classes, so I don't think it will matter. You might try installing with PR 557 though. I believe that should at least improve the performance you are seeing.

joe-keen · 2016-02-19T03:36:01Z

I captured a stack trace that shows the program segfaulting trying to construct a new tuple. At the time it segfaulted it had allocated slightly over 9GB of ram so it looks like an out of memory error. If I can figure out how to get the gdb python extensions installed in my environment I'll be able to give more information.

The stack trace for what it's worth:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000537388 in _PyObject_GC_Malloc ()
(gdb) where
#0 0x0000000000537388 in _PyObject_GC_Malloc ()
#1 0x000000000052b59d in PyTuple_New ()
#2 0x000000000049a130 in PyEval_EvalFrameEx ()
#3 0x00000000004a090c in PyEval_EvalCodeEx ()
#4 0x0000000000499a52 in PyEval_EvalFrameEx ()
#5 0x00000000004a090c in PyEval_EvalCodeEx ()
#6 0x000000000049ab45 in PyEval_EvalFrameEx ()
#7 0x00000000004a090c in PyEval_EvalCodeEx ()
#8 0x000000000049ab45 in PyEval_EvalFrameEx ()
#9 0x00000000004a090c in PyEval_EvalCodeEx ()
#10 0x000000000049ab45 in PyEval_EvalFrameEx ()
#11 0x00000000004a090c in PyEval_EvalCodeEx ()
#12 0x000000000049ab45 in PyEval_EvalFrameEx ()
#13 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#14 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#15 0x00000000004a090c in PyEval_EvalCodeEx ()
#16 0x0000000000499a52 in PyEval_EvalFrameEx ()
#17 0x00000000004a090c in PyEval_EvalCodeEx ()
#18 0x000000000049ab45 in PyEval_EvalFrameEx ()
#19 0x00000000004a090c in PyEval_EvalCodeEx ()
#20 0x0000000000499a52 in PyEval_EvalFrameEx ()
#21 0x00000000004a090c in PyEval_EvalCodeEx ()
#22 0x000000000049ab45 in PyEval_EvalFrameEx ()
#23 0x00000000004a090c in PyEval_EvalCodeEx ()
#24 0x0000000000499a52 in PyEval_EvalFrameEx ()
#25 0x00000000004a090c in PyEval_EvalCodeEx ()
#26 0x0000000000499a52 in PyEval_EvalFrameEx ()
#27 0x00000000004a2d2e in ?? ()
#28 0x000000000049990f in PyEval_EvalFrameEx ()
#29 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#30 0x00000000004a1634 in ?? ()
#31 0x000000000044e4a5 in PyRun_FileExFlags ()
#32 0x000000000044ec9f in PyRun_SimpleFileExFlags ()
#33 0x000000000044f904 in Py_Main ()
#34 0x00007ffff7818ec5 in __libc_start_main (main=0x44f9c2

, argc=2, argv=0x7fffffffe5c8,
init=, fini=, rtld_fini=,
stack_end=0x7fffffffe5b8) at libc-start.c:287
#35 0x0000000000578c4e in _start ()

It looks like it's running into the Out Of Memory killer which explains why you can't reproduce it. My vagrant VM has fairly limited resources.

[ 5285.531613] Out of memory: Kill process 16916 (python) score 759 or sacrifice child
[ 5285.534207] Killed process 16916 (python) total-vm:9823872kB, anon-rss:9717040kB, file-rss:0kB

Running top while running my test script I can see it allocate several hundred MB every five seconds or so.

dpkp · 2016-03-17T18:38:11Z

@joe-keen 1.0.2 was just released. if you have time, you might test it out and see if you see the same problems.

dpkp · 2016-05-22T19:07:12Z

I am unable to reproduce -- please reopen if you see similar behavior on the latest release.

kafka-python version 1.0 was just released and has several issues. There are performance regressions. See, dpkp/kafka-python#551 and the interface has changed. Safest to set the upper bound to < 1.0 until issues are resolved. Change-Id: I5921d1e7659fd41cff7bd5cddad4db0c015e671b

Patch Set 1: Some of the issues I'm concerned about: dpkp/kafka-python#674 dpkp/kafka-python#686 dpkp/kafka-python#579 dpkp/kafka-python#551 Patch-set: 1

dpkp mentioned this issue Feb 18, 2016

Dont override system rcvbuf or sndbuf unless user configures explicitly #557

Merged

dpkp closed this as completed May 22, 2016

tanaypf9 pushed a commit to tanaypf9/pf9-requirements that referenced this issue May 20, 2024

Update patch set 1

aea207a

Patch Set 1: Some of the issues I'm concerned about: dpkp/kafka-python#674 dpkp/kafka-python#686 dpkp/kafka-python#579 dpkp/kafka-python#551 Patch-set: 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression and seg faults moving from 0.9.5 to 1.0 #551

Performance regression and seg faults moving from 0.9.5 to 1.0 #551

joe-keen commented Feb 17, 2016

dpkp commented Feb 17, 2016

joe-keen commented Feb 17, 2016

dpkp commented Feb 18, 2016

joe-keen commented Feb 18, 2016

dpkp commented Feb 18, 2016

joe-keen commented Feb 19, 2016

dpkp commented Mar 17, 2016

dpkp commented May 22, 2016

Performance regression and seg faults moving from 0.9.5 to 1.0 #551

Performance regression and seg faults moving from 0.9.5 to 1.0 #551

Comments

joe-keen commented Feb 17, 2016

dpkp commented Feb 17, 2016

joe-keen commented Feb 17, 2016

dpkp commented Feb 18, 2016

joe-keen commented Feb 18, 2016

dpkp commented Feb 18, 2016

joe-keen commented Feb 19, 2016

dpkp commented Mar 17, 2016

dpkp commented May 22, 2016