Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression and seg faults moving from 0.9.5 to 1.0 #551

Closed
joe-keen opened this issue Feb 17, 2016 · 8 comments
Closed

Performance regression and seg faults moving from 0.9.5 to 1.0 #551

joe-keen opened this issue Feb 17, 2016 · 8 comments

Comments

@joe-keen
Copy link

I have a test script that uses a KeyedProducer and a SimpleConsumer to test the consumer performance at a various fetch sizes starting at 50K up to 2M in increments of 25K.

In 0.9.5 I see a read speed from the SimpleConsumer of 50K to 60K messages per second depending on the fetch size. In 1.0 using the same test scripts I see speeds in the 7K to 10K range using the same objects and it seg faults before it reaches the 2M fetch size. I've seen seg faults at fetch sizes as low as 225K.

Attached is the test script I use.
kafka_test.zip

@dpkp
Copy link
Owner

dpkp commented Feb 17, 2016

segfault is very strange given that the module is pure python. So the first thing that jumps out at me there is ujson, which is a wrapper around a C library. I would expect that to be the most likely cause of segfaults in your script.

Also note that you do not need to import differently in 0.9.5 v. 1.0.0 -- just import everything from kafka instead of kafka.client / kafka.consumer / kafka.producer.

Thanks for the test script, I'll play around and see what I find.

@joe-keen
Copy link
Author

The import difference was to be sure we were using the same client that was available in 0.9.5. ujson shouldn't have had any effect; the segfault was a few minutes after the last use of it. I was doing testing of this in a vagrant environment which you can use for your testing if you want. I was able to reproduce the segfault in multiple VMs built from the same Vagrant file.

@dpkp
Copy link
Owner

dpkp commented Feb 18, 2016

The performance regression seems to be due to the SO_RCVBUF handling that was added for KafkaConsumer / KafkaProducer. I can patch that to behave as before when using the old interfaces. That patch gives me 50K messages per sec in my local testing.

I have not been able to reproduce the segfault. I remain skeptical b/c there is no C code in kafka-python. Any further information you can provide on that would be helpful. Otherwise, I'm not planning to investigate that part any further.

@joe-keen
Copy link
Author

I'm still able to make it segfault consistently in my Vagrant environemnt even with ujson replaced with json. I can't seem to get a viable core file out of it, most of them are truncated for some reason. With my most recent Vagrant build, using kafka version kafka_2.9.2-0.8.1.1, I was unable to get a segfault initially. I placed the test script I provided in a bash while loop and after a short while it was segfaulting every run. That behaviour remains after letting the Vagrant environment sit idle overnight.

The fd leak you fixed yesterday looks like it would be pretty minor if I'm reading it correctly. Do you think it's worth testing the kafka-python client with that fix to see if the segfaults go away?

If you want to try using the same environment you can use it with the following steps:

  1. Clone https://github.com/openstack/monasca-api
  2. cd to monasca-api/devstack
  3. vagrant up
  4. Wait about 20 minutes
  5. After the vagrant environment is loaded kafka and zookeeper are up and running and your home directory is mounted under /vagrant_home.
  6. I was simply creating a virtual environment, installing kafka-python 1.0, and running the test script I provided to cause the segfaults.

@dpkp
Copy link
Owner

dpkp commented Feb 18, 2016

The fd fix only affects the new classes. You are testing with the old classes, so I don't think it will matter. You might try installing with PR 557 though. I believe that should at least improve the performance you are seeing.

@joe-keen
Copy link
Author

I captured a stack trace that shows the program segfaulting trying to construct a new tuple. At the time it segfaulted it had allocated slightly over 9GB of ram so it looks like an out of memory error. If I can figure out how to get the gdb python extensions installed in my environment I'll be able to give more information.

The stack trace for what it's worth:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000537388 in _PyObject_GC_Malloc ()
(gdb) where
#0 0x0000000000537388 in _PyObject_GC_Malloc ()
#1 0x000000000052b59d in PyTuple_New ()
#2 0x000000000049a130 in PyEval_EvalFrameEx ()
#3 0x00000000004a090c in PyEval_EvalCodeEx ()
#4 0x0000000000499a52 in PyEval_EvalFrameEx ()
#5 0x00000000004a090c in PyEval_EvalCodeEx ()
#6 0x000000000049ab45 in PyEval_EvalFrameEx ()
#7 0x00000000004a090c in PyEval_EvalCodeEx ()
#8 0x000000000049ab45 in PyEval_EvalFrameEx ()
#9 0x00000000004a090c in PyEval_EvalCodeEx ()
#10 0x000000000049ab45 in PyEval_EvalFrameEx ()
#11 0x00000000004a090c in PyEval_EvalCodeEx ()
#12 0x000000000049ab45 in PyEval_EvalFrameEx ()
#13 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#14 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#15 0x00000000004a090c in PyEval_EvalCodeEx ()
#16 0x0000000000499a52 in PyEval_EvalFrameEx ()
#17 0x00000000004a090c in PyEval_EvalCodeEx ()
#18 0x000000000049ab45 in PyEval_EvalFrameEx ()
#19 0x00000000004a090c in PyEval_EvalCodeEx ()
#20 0x0000000000499a52 in PyEval_EvalFrameEx ()
#21 0x00000000004a090c in PyEval_EvalCodeEx ()
#22 0x000000000049ab45 in PyEval_EvalFrameEx ()
#23 0x00000000004a090c in PyEval_EvalCodeEx ()
#24 0x0000000000499a52 in PyEval_EvalFrameEx ()
#25 0x00000000004a090c in PyEval_EvalCodeEx ()
#26 0x0000000000499a52 in PyEval_EvalFrameEx ()
#27 0x00000000004a2d2e in ?? ()
#28 0x000000000049990f in PyEval_EvalFrameEx ()
#29 0x0000000000499ef2 in PyEval_EvalFrameEx ()
#30 0x00000000004a1634 in ?? ()
#31 0x000000000044e4a5 in PyRun_FileExFlags ()
#32 0x000000000044ec9f in PyRun_SimpleFileExFlags ()
#33 0x000000000044f904 in Py_Main ()
#34 0x00007ffff7818ec5 in __libc_start_main (main=0x44f9c2

, argc=2, argv=0x7fffffffe5c8,
init=, fini=, rtld_fini=,
stack_end=0x7fffffffe5b8) at libc-start.c:287
#35 0x0000000000578c4e in _start ()

It looks like it's running into the Out Of Memory killer which explains why you can't reproduce it. My vagrant VM has fairly limited resources.

[ 5285.531613] Out of memory: Kill process 16916 (python) score 759 or sacrifice child
[ 5285.534207] Killed process 16916 (python) total-vm:9823872kB, anon-rss:9717040kB, file-rss:0kB

Running top while running my test script I can see it allocate several hundred MB every five seconds or so.

@dpkp
Copy link
Owner

dpkp commented Mar 17, 2016

@joe-keen 1.0.2 was just released. if you have time, you might test it out and see if you see the same problems.

@dpkp
Copy link
Owner

dpkp commented May 22, 2016

I am unable to reproduce -- please reopen if you see similar behavior on the latest release.

@dpkp dpkp closed this as completed May 22, 2016
tanaypf9 pushed a commit to tanaypf9/pf9-requirements that referenced this issue May 20, 2024
kafka-python version 1.0 was just released and has several issues.
There are performance regressions. See,

    dpkp/kafka-python#551

and the interface has changed.

Safest to set the upper bound to < 1.0 until issues are resolved.

Change-Id: I5921d1e7659fd41cff7bd5cddad4db0c015e671b
tanaypf9 pushed a commit to tanaypf9/pf9-requirements that referenced this issue May 20, 2024
Patch Set 1:

Some of the issues I'm concerned about:

dpkp/kafka-python#674

dpkp/kafka-python#686

dpkp/kafka-python#579

dpkp/kafka-python#551

Patch-set: 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants