Flamegraph does not match table data and callgraph #212

josdagaro · 2017-10-19T19:56:47Z

The function main inclusive wall time value is the sum of the time invested in itself in addition to its children, this value is correct in the flamegraph, but in the callgraph and table data the values aren't the same.

josdagaro · 2017-10-19T19:57:49Z

I need help understanding this.

michaelbutler · 2017-10-19T20:04:02Z

I think there is a bug in the flamegraph. Notice how on the main Recent runs page it says 150ms wall time, that's the overall time it took for the whole request. You can confirm using chrome network inspector. The callgraph also shows main() taking 150ms wall time, also correct. The flamegraph says 454ms, that is not correct. The reason for this is that it adds the total time (150ms) plus the total time of all children, greatly inflating the value.

Bug ticket opened here #207 though I have not had a chance to fix it.

michaelbutler · 2017-10-19T20:05:37Z

sidenote: check your server charset settings, looks like you are possibly serving up ISO-8859-1 encoding instead of UTF-8.

josdagaro · 2017-10-19T21:18:51Z

I'm confused, if the flamegraph is incorrect, ¿how is the total wall time calculated?

josdagaro · 2017-10-19T21:39:05Z

The inclusive wall time of main function is the sum of all its secondary functions, ¿right? So, ¿why is the sum of the self wall times of their children not equivalent to the inclusive wall time of the main function?

markstory · 2017-10-20T00:52:28Z

So, ¿why is the sum of the self wall times of their children not equivalent to the inclusive wall time of the main function?

It should be close, but it won't be exact as xhprof is not able to capture 100% of every operation, there is always a bit of slippage/loss.

josdagaro · 2017-10-20T14:05:38Z

Check this:

¿Can you see that for me the most logical graph is flamegraph? The sum seems correct, but in the callgraph I do not understand why the time is different:

Flamegraph: CI_DB_driver::simple_query (44,176 ms) + pg_connect (20,067 ms) = CI_DB_postgre_driver::db_connect (62,0243)

Callgraph: CI_DB_driver::simple_query (44,176 ms) + pg_connect (20,067 ms) = CI_DB_postgre_driver::db_connect (32,229)

More information:

Please, help me to understand it.

michaelbutler · 2017-10-20T14:30:10Z

First you need to find out what is really the true correct inclusive wall time values.

Your data (the index table, which should be the correct thing) is reporting that db_connect took 32ms inclusive wall time... that means it took 32ms if you did something like this:

$t = microtime(true);
whatever->db_connect()
error_log(microtime(true) - $t);

Most surely that is correct, but please confirm by actually using that snippet above and writing the value with error_log. Next, use the same snippet to get the overall time of your request:

$t = microtime(true);
MyApplication::run(); // whatever the main entry point to your app that does everything.
error_log(microtime(true) - $t);

markstory · 2017-10-20T20:57:36Z

You could also look at the data in mongodb. If xhprof has recorded the data 'wrong' we can never make the page look right.

josdagaro · 2017-10-23T15:15:12Z

I'm very confused, ¿is it possible that "self wall time" is greater than "inclusive wall time"?

michaelbutler · 2017-10-23T19:37:57Z

Inclusive Wall Time will always be greater than Self Wall Time.

josdagaro · 2017-10-23T19:55:45Z

Oh, right, I saw wrong..., I thought time was 468,000 but is just 468; my error.
Ok, I saw this page from a third party: xhprof and the results are the same.

josdagaro · 2017-10-23T19:57:53Z

¿Is the wall time sum in the flamegraph the same of callgraph?
That is, ¿does the callgraph wall time sum like the flamegraph?

markstory · 2017-10-24T01:04:50Z

Is the wall time sum in the flamegraph the same of callgraph?

It should be. The callgraph and flamegraph use the same data under the hood.

Krinkle · 2018-09-16T17:47:37Z

@markstory wrote at #216 (comment):

Last I remember there were different display issues with these changes. I haven't had much time to dedicate to this project as of late though.

@markstory Yes, the display issues that are made apparent by #216, however, the issue is pre-existent.

In my opinion, the current flame graph implementation is almost useless, and highly deceptive. The only reason it doesn't look broken is due to various bugs cancelling each other out. But a close look at any part of the stack, exposes its information to be non-sensical.

I looked at this earlier when we talked at #216, but after another very confusing debug round today, I realised the problem is much bigger than I thought. Below is an example:

Actual call tree

(Unbeknownst to xhprof)

main() - [wt: 1000ms]
- 1x smallStuff() [wt: 100ms]
  - 1x process('small') [wt: 100ms]
    - 10x processOne() [wt: 100ms, 10ms each]
- 1x bigStuff() [wt: 900ms]
  - 1x process('big') [wt: 900ms]
    - 90x processOne() [wt: 900ms, 10ms each]

stacks

From xhprof (unsampled)

main() => smallStuff()    [ c: 1, wt: 100ms ]
main() => bigStuff()      [ c: 1, wt: 900ms ]
smallStuff() => process() [ c: 1, wt: 100ms ]
bigStuff() => process()   [ c: 1, wt: 900ms ]
process() => processOne() [ c: 100, wt: 1000ms ]

flamegraph-data

XHGui currently converts the above as follows:

            'data' => array(
                'name' => 'main()',
                'value' => 1000,
                'children' => array(
                    array(
                        'name' => 'smallStuff()',
                        'value' => 100,
                        'children' => array(
                            array(
                                'name' => 'process()',
                                'value' => 1000, # bigger than parent
                                'children' => array(
                                    array(
                                        'name' => 'processOne()',
                                        'value' => 1000,
                                    ),
                                ),
                            ),
                        ),
                    ),
                    array(
                        'name' => 'bigStuff()',
                        'value' => 900,
                        'children' => array(
                            array(
                                'name' => 'process()',
                                'value' => 1000 # bigger than parent
                            ),
                        ),
                    ),
                ),
            ),
            'sort' => array(
                'main()' => 0,
                'smallStuff()' => 1,
                'bigStuff()' => 4,
                'process()' => 2,
                'processOne()' => 3,
            )

flamegraph-visual

main() – 2000ms
- smallStuff() – 1000ms
  - process() – 1000ms
    - processOne() – 1000ms
- bigStuff() – 100ms
- process() – 1000ms

I originally thought that the only problem was that D3-flamegraph sees each node as self-time (instead of inclusive time, per #207). But that is not exactly the case.

If D3-flamegraph was seeing each node reporting self-time then smallStuff/process would be computed as 2000ms (1000 self, 1000 of its children), and smallStuff as 3000ms, etc..

In actuality, D3-flamegraph is throwing away 90% of all timing data, in favour of only summing the leaf nodes. This means any time spent in any method anywhere in the system, unless it is a leaf node, it is not represented in the graph.

Of course, the problem is not that D3-flamegraph is using a wrong formula for aggregating self-time. The problem is that it tries to aggregate at all, because the XHProf data is not self-time, it is already aggregated.

Problems with old D3-flamegraph:

It discards all time spent in any method, except for leaf nodes.
It replaces the time for each method with the sum of the descendant leaf nodes.
The time spent in main() is often 10x or 1000x larger than it should be.
The overall proportions are incorrect. Slow code can appear fast, fast code can appear slow. If the problem was that it was doing self-time aggregation only (and doing it correctly), it would at least have correct overall proportions (although it would be exaggerated in that case).

Problems with flamegraph-data export:

Whenever a duplicate pair of parent>child is seen, only the first encounter (seemingly random) gets a child stack. The rest will have no children.

markstory · 2018-09-17T20:49:48Z

I'm ok with removing the flamegraph output entirely. If its not helpful and can't be made helpful there is no point in keeping it.

Krinkle · 2018-09-17T21:52:53Z

@markstory Yeah, I'm sad to see it go, but in the short-medium term we should probably look into the Callgraph visualisation instead.

Callgraph is a bit harder to navigate, but is true to the data and super accurate. In it also useful in ways that Flamegraphs can't be, such as visually emphasising methods that are called a lot and take much time, but are called from different sub-trees. The Callgraph shows that time combined really well.

Would you accept a PR that removes the component?

markstory · 2018-09-19T03:53:01Z

Would you accept a PR that removes the component?

Yes 👍

This unfortunately never worked correctly due to a fundamental limitation with the XHProf data format, which is that it only records metadata per parent-child method combination, it cannot be used to build a complete call tree. The feature was added in pull perftools#177. More information about this limitation is described in detail at perftools#219, perftools#216 and perftools#212. A brief summary: * The visualisation showed callstacks that did not actually exist, and was missing callstacks that did exist. (Due to assuming that every combination of a parent-child pair is valid, and due to it randomly assinging repeated calls to the first encountered parent.) * The visualisation discarded all timing values from XHProf, except for the timing of leaf nodes (methods without children), which were then added up recursively. The end result was a visually well-balanced tree, but with timing values that were not related to the actual performance (upto 100x inflated), and the proportions were incorrect as well, making some code look fast instead of slow, and vice versa. These are inherent problems that cannot be solved because the information logically required to make a flamegraph (call stacks) is not collected by XHProf. This closes perftools#216, perftools#212, perftools#211, perftools#207. This fixes perftools#212.

This unfortunately never worked correctly due to a fundamental limitation with the XHProf data format, which is that it only records metadata per parent-child method combination, it cannot be used to build a complete call tree. The feature was added in pull perftools#177. More information about this limitation is described in detail at perftools#219, perftools#216 and perftools#212. A brief summary: * The visualisation showed callstacks that did not actually exist, and was missing callstacks that did exist. (Due to assuming that every combination of a parent-child pair is valid, and due to it randomly assinging repeated calls to the first encountered parent.) * The visualisation discarded all timing values from XHProf, except for the timing of leaf nodes (methods without children), which were then added up recursively. The end result was a visually well-balanced tree, but with timing values that were not related to the actual performance (upto 100x inflated), and the proportions were incorrect as well, making some code look fast instead of slow, and vice versa. These are inherent problems that cannot be solved because the information logically required to make a flamegraph (call stacks) is not collected by XHProf. This closes perftools#212, perftools#211, perftools#207. This fixes perftools#212.

Krinkle mentioned this issue Nov 14, 2017

Upgrade D3 and d3-flameGraph #216

Closed

markstory mentioned this issue Nov 15, 2017

Flamegraph assumes nodes are self-time only #207

Closed

Krinkle mentioned this issue Sep 23, 2018

flamegraph: Remove the flamegraph feature #246

Merged

markstory closed this as completed in #246 Sep 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flamegraph does not match table data and callgraph #212

Flamegraph does not match table data and callgraph #212

josdagaro commented Oct 19, 2017 •

edited

Loading

josdagaro commented Oct 19, 2017

michaelbutler commented Oct 19, 2017

michaelbutler commented Oct 19, 2017

josdagaro commented Oct 19, 2017

josdagaro commented Oct 19, 2017

markstory commented Oct 20, 2017

josdagaro commented Oct 20, 2017 •

edited

Loading

michaelbutler commented Oct 20, 2017

markstory commented Oct 20, 2017

josdagaro commented Oct 23, 2017 •

edited

Loading

michaelbutler commented Oct 23, 2017

josdagaro commented Oct 23, 2017

josdagaro commented Oct 23, 2017 •

edited

Loading

markstory commented Oct 24, 2017

Krinkle commented Sep 16, 2018 •

edited

Loading

markstory commented Sep 17, 2018 •

edited

Loading

Krinkle commented Sep 17, 2018

markstory commented Sep 19, 2018

Flamegraph does not match table data and callgraph #212

Flamegraph does not match table data and callgraph #212

Comments

josdagaro commented Oct 19, 2017 • edited Loading

josdagaro commented Oct 19, 2017

michaelbutler commented Oct 19, 2017

michaelbutler commented Oct 19, 2017

josdagaro commented Oct 19, 2017

josdagaro commented Oct 19, 2017

markstory commented Oct 20, 2017

josdagaro commented Oct 20, 2017 • edited Loading

michaelbutler commented Oct 20, 2017

markstory commented Oct 20, 2017

josdagaro commented Oct 23, 2017 • edited Loading

michaelbutler commented Oct 23, 2017

josdagaro commented Oct 23, 2017

josdagaro commented Oct 23, 2017 • edited Loading

markstory commented Oct 24, 2017

Krinkle commented Sep 16, 2018 • edited Loading

Actual call tree

stacks

flamegraph-data

flamegraph-visual

markstory commented Sep 17, 2018 • edited Loading

Krinkle commented Sep 17, 2018

markstory commented Sep 19, 2018

josdagaro commented Oct 19, 2017 •

edited

Loading

josdagaro commented Oct 20, 2017 •

edited

Loading

josdagaro commented Oct 23, 2017 •

edited

Loading

josdagaro commented Oct 23, 2017 •

edited

Loading

Krinkle commented Sep 16, 2018 •

edited

Loading

markstory commented Sep 17, 2018 •

edited

Loading