Creating invoice + making payment is slower and slower #1506

NicolasDorier · 2018-05-15T06:39:42Z

I am benchmarking the simplest case where Alice pays Bob:

var invoice = await Bob.RPC.CreateInvoice(LightMoney.Satoshis(100));
await Alice.RPC.SendAsync(invoice.BOLT11);

It seems it takes more and more time. I guess auto clean invoice would solve the issue... but still, it should not be linear.

This is one 100 iterations, where each iteration is 16 run of 3 concurrent payments.

The text was updated successfully, but these errors were encountered:

ZmnSCPxj · 2018-05-15T08:29:46Z

It could be the number of payments on Alice, not just the number of invoices on Bob.

NicolasDorier · 2018-05-15T10:27:56Z

Ok so to check this, I create 3 differents Alice instead of 3 concurrent payment by the same Alice:

There seems to have also some contention for in the case of 3 payments of the same Alice at same time costing around 100 ms.

It seems there is indeed improvement, but I am surprised it is not a 3 times improvement. I would have expected the slope to get 3 times slower.

NicolasDorier · 2018-05-15T10:35:52Z

Slope:

Alice pays Bob: 2495
Alices pay Bob: 1120.75

ZmnSCPxj · 2018-05-15T12:38:12Z

Possibly the remaining part is Bob getting overloaded, too. Hence not the three times improvement. Invoice payment lookup should be O(log n), but payment lookup is probably much less efficient (not as certain, I am more familiar with the invoice side than the payment side, despite practically writing the current pay command implementation).

If you are doing multiple pay commands in parallel, in particular, some lookups are performed which are O(n) in number of payments (payments getting completed or failed in particular have a linear lookup).

lightning/lightningd/pay.c

Lines 75 to 92 in 11ca729

    
           /* Caller responsible for freeing ctx. */ 
        
           static void waitsendpay_resolve(const tal_t *ctx, 
        
           				struct lightningd *ld, 
        
           				const struct sha256 *payment_hash, 
        
           				const struct sendpay_result *result) 
        
           { 
        
           	struct sendpay_command *pc; 
        
           	struct sendpay_command *next; 
        
           	list_for_each_safe(&ld->waitsendpay_commands, pc, next, list) { 
        
           		if (!structeq(payment_hash, &pc->payment_hash)) 
        
           			continue; 
        
           		/* Delete later (in our own caller) if callback did 
        
           		 * not delete. */ 
        
           		tal_steal(ctx, pc); 
        
           		pc->cb(result, pc->cbarg); 
        
           	} 
        
           }

NicolasDorier · 2018-05-15T13:09:02Z

is there a way to purge this data?

ZmnSCPxj · 2018-05-15T13:17:48Z

The linear list above is purged when a payment resolves: everything waiting for the payment to resolve is remove from the list, leaving only the payments that have not resolved yet. However there is still a database table on-disk that contains payments, and is also looked up when a payment resolves (succeeds or fails). The database table has an index on the payment_hash, so should be O(log n), though.

We would do better with concrete profiling information, though --- would that be possible to provide? gcc should have an option to create profiling information in particular. Especially it would be a good way to determine if the Alices or the Bob is slower.

NicolasDorier · 2018-05-16T11:53:36Z

What do you mean by "waiting payment to resolve"? I am not only creating invoice here, I am also paying them.

NicolasDorier · 2018-05-16T11:57:54Z

I am creating a project to easily benchmark some scenaris https://github.com/dgarage/LightningBenchmarks (it will also support other implementation) though GCC profiling would give better idea about particular code causing real issues. I can include it in the list of generated artefact.

ZmnSCPxj · 2018-05-16T15:01:36Z

The pay command (which is the command that understands bolt11 invoices and executes the algorithm to ensure the payment completes) waits for an individual payment attempt to complete, then if it fails, asks for another route and tries a new payment attempt. When the first HTLC in the routed payment is sent it is "waiting for payment to resolve". Multiple such pay commands to various invoice payment preimages are put in a list that the above quoted code queries using an O(n) algorithm (basically it just iterates over the list until it finds the pay command that is executing that payment attempt).

Payments are resolved if they are known to have succeeded, or known to have failed or timed out.

Note that in C-lightning, a "payment" is an object on the payer that tracks an individual payment attempt (an attempt at getting the preimage for a payment hash in exchange for an amount), while an "invoice" is an object on the payee that has a preimage and its hash and which is resolved when it releases the preimage in exchange for receiving money. So when I say "waiting for payment to resolve" I mean an object on the payer that is waiting for the routed payment to return a success or failure.

NicolasDorier · 2018-05-17T05:42:52Z

Yes but what I don't undesrtand is that this list of awaiting payments should be purged when pay get successfully routed. And they all are successfully routed, but still the time it takes increase.

ZmnSCPxj · 2018-05-17T07:12:05Z

I probably need to audit the database tables then... maybe we forgot to index something we are using as key in that table.

We are also using raw 32-byte blobs as keys in the database table, so maybe SQLITE does not handle those well...

rockstardev · 2018-05-17T17:20:29Z

@NicolasDorier I've just done similar rudimentary test with lnd I have as part of that docker integration I'm doing for BtcPayServer... and payment time does seem to stay constant.

So as @ZmnSCPxj says this may likely be problem with indexing or something similar that increases search time as number of transactions grow.

NicolasDorier · 2018-05-20T10:11:20Z

@cdecker suggested I use perftrace to find out where is the issue. I will do during next week.

I never used it though, so if someone can give me pointer it would help me a lot.

ZmnSCPxj · 2018-05-21T00:10:19Z

@NicolasDorier quick question, does Bob perform any waitinvoice or waitanyinvoice?

NicolasDorier · 2018-05-21T04:19:23Z

no

NicolasDorier · 2018-05-21T12:55:03Z

Ok so I managed to get more info in the stack traces. Sadly I don't think it help.
alice2 and bob2 svgs have been taken some time after alice1 and alice2 svgs. There is not really any change.

~~traces.zip~~ (more updated versions below)

If the issue is related to IO, would we see it in those graphs? I would assume we won't as the calls are asynchronous.

ZmnSCPxj · 2018-05-21T13:05:02Z

Looks like a good part is logging. Hmm. That and fork.

NicolasDorier · 2018-05-21T13:58:22Z

I am wondering if this is the full picture.

Because when you do a request to the DB, I don't think you block the thread so it would not show on the flamegraph, but it would still make the call slow.

ZmnSCPxj · 2018-05-21T14:15:12Z

We are using sqlite3, which to my understanding has its code execute in the same process as the caller. All our db calls block rather than have a result callback.

NicolasDorier · 2018-05-21T15:52:29Z

Trying to have more data point... running the script for longer taking flamegraph along the way... hopefully we can also see what is taking more and more time.

NicolasDorier · 2018-05-21T15:59:01Z

I am concerned also about why I don't see any call trace in pay or create invoice that I call by RPC.

NicolasDorier · 2018-05-21T16:22:31Z

I let it run for like 10-15 minutes.

At beginning (Alice1 and Bob1) 7 concurrent payments took 700ms, at the end (Alice3 and Bob3) took 3 seconds.

By opening the traces of different tab of the browser and switching via CTLR+TAB you can see the evolution over time Bob1, Bob2, Bob3 or Alice.

I am disappointed. While log_status_msg takes more and more time, it really does not stand that much out, so I don't think this is the cause of the slowdown. (However, it should be improved)

traces.zip

Also, why I can't see any call to the RPC methods?
I am using socat to talk to Bob and Alice RPC from the host, this is probably adding some overhead so I will try to remove this part tomorrow. I doubt this is the cause though.

cdecker · 2018-05-21T16:37:05Z

The logging subsystem was never this taxing in any of my traces, that's surprising. Is the same true if you redirect to /dev/null? Any special shell you are using?

What bothers me is that the sqlite3_* calls are nowhere close to what I was expecting (they're really hard to find at all). But that's an issue I had in my own testing as well, anybody have a good idea on how to make these stacks include them?

NicolasDorier · 2018-05-21T16:38:23Z

Yes, that surprise me as well. I will turn off logging and run again. (was debug)

NicolasDorier · 2018-05-21T16:42:43Z

I am using docker-compose. The output is not shown at all on the screen actually.

NicolasDorier · 2018-05-21T17:17:06Z

Ok so, using debug-level=broken increased considerably the perfs (2x faster). But there is still a linear increase.

My CPU and IO is mainly unused.

Log stuff are still taking lot's of time. I am running lightningd &. Tomorrow I will try /dev/null if it improves anything...

I will also remove socat from the equation. Probably not the problem but it definitively add overhead.

Those are the last traces... and they are pretty boring, almost similar to previous one, with a bit less time on logging.

traces.zip

Will continue tomorrow...

NicolasDorier · 2018-05-22T07:41:37Z

Ok. So I can't really remove socat from equation without massive headaches.

However, I refactored my code to limit the number of time I connect to a socket.
It improved the performances.

docker stats show I am near 100% CPU. While my Hyper-V VM hosting docker is using like 10% of CPU, and my host CPU is around 60% only... long story short I will run my tests on a separate hosted VM (digital ocean or Azure) so I can convince myself about the performances.

I still present my finding here:

The linear issue outlined is still unsolved. I let Alice and Bob mind their own business for 30 minutes while taking a flamegraph every 5 minutes.

My finding are:

Logs functions are suspiciously taking way more time than they should, even though I run with debug-level broken, culprits:

status_fmt
log_status_msg
tal_vfmt
get_first

I could not pin point any big issues aside #1518
It seems things come from ccan lib (ping @rustyrussell )

Also, huge amount of time is spent in __clone. I have no idea what it is.

I have still not identified the linear time issue.

Here are the latest flamegraphs I ran.
traces.zip

ZmnSCPxj · 2018-05-22T08:56:59Z

__clone is most likely the Linux clone syscall, or a wrapper thereof. It implements both the process-level fork and the thread-level pthread_create. We do not use threads explicitly, only fork. But the large amount of time it spends in __clone is very strange --- we do not expect modern OSs to have slow fork.

NicolasDorier · 2018-05-22T09:54:56Z

I am using alpine3.7 with some hacks for glib compat.

NicolasDorier · 2018-05-22T13:46:55Z

fork isn't supposed to create a child process? Because I seen nothing like it being created by running ps.

NicolasDorier · 2018-05-22T15:55:57Z

log-level to broken+ #1518 + not opening one socket for every request improved perf for 3 concurrents call dramatically (2500ms => 900ms after 16 * 100 payments)

However still consistent increase... tried to run those tests on a separate VM in case it was my PC's issue but no luck, same result.

ZmnSCPxj · 2018-05-22T23:23:52Z

We create a number of child processes: hsmd, gossipd, and a channeld for each channel. But the fact that __clone dominates is very strange, since we do not run that many child processes.

Perhaps some library is using threads or processes under-the-hood. But in any case it seems we will not get any information from the __clone part very well.

The fact that the concurency=1 still has a slight slope implies some of the time is spent in some linear search of all payments or invoices.

ZmnSCPxj · 2018-05-22T23:38:19Z

I suddenly remembered that we also fork out calls to bitcoin-cli. The call itself is relatively transient, but we make a large number of those while tracking the chain. Hmm. Presumably we can ignore the __clone if so.

NicolasDorier · 2018-05-23T02:30:29Z

If so, why do you call bitcoin-cli so much when doing a payment? Normally making an invoice or payment does not involve anything on chain.

NicolasDorier · 2018-05-23T02:40:45Z

I am starting to wonder if the reason I don't find anything relevant increasing in the flamegraph is because the slowdown is coming from inside the bitcoin-cli call.

(Tried also to run flamegraph on idle Alice, no __clone)

NicolasDorier · 2018-05-23T03:58:07Z

@ZmnSCPxj I inserted a printf to see all bitcoin-cli commands. This is not the problem, no command is sent during the run.

NicolasDorier · 2018-05-23T04:07:21Z

I throw in the sponge. Let me know if I can try anything else to find out more info.
I will gather all the benchs and docs to reproduce it in a git repo as well as current results.

Feel free to close this issue except if you have idea where I can dig more about this.

ZmnSCPxj · 2018-05-23T05:01:25Z

Okay. I will try to find some time to analyze your gathered data in detail. This makes me wonder if there are some threads going on in the background in some library we are using; I begin to suspect sqlite3. Hmm.

NicolasDorier · 2018-05-23T05:24:03Z

is there a way to hook __clone? Maybe I can try something with gdb?

NicolasDorier · 2018-05-23T05:45:25Z

Tried to take a look with htop while it is running hoping to find more. lightningd and channeld seems to leak.

I don't think I can see the threads correctly with htop on alpine though.

NicolasDorier · 2018-05-23T06:12:17Z

I just don't understand... I added a breakpoint through gdb on __clone of lightningd_channel. At same time I run a perftrace... my breakpoint does not get hit but __clone show on the trace.

ZmnSCPxj · 2018-05-23T16:49:14Z

It is possible that the perftrace gets confused with regards to OS calls. OS calls might be using a completely separate stack, for example, so it appears to root from the original __clone that started up lightningd itself. I suggest that we should pay more attention to OS calls like poll, read, and write than on __clone.

Interesting tidbit is timer_earliest, which is somewhat fatter on the Alice_traces/33min.svg than in the Alice_traces/9min.svg. The pay RPC command, on starting, sets up a timer that will stop the payment attempt. Timers apparently use some sort of hierarchy of lists that I do not understand right now (summon @rustyrussell for help). There is a single struct timers used to handle all timers inside lightningd --- perhaps the complicated hierarchy of lists is getting more difficult to traverse as more individual struct timer objects are registered onto it. Indeed one of the fat parts is brute_force_first, which even helpfully has this comment:

lightning/ccan/ccan/timer/timer.c

Line 198 in 8d64145

/* FIXME: Suboptimal */

Another thing fatter in Alice_traces/33min.svg than in Alice_traces/9min.svg is the flame starting off timer_expired, which is called when we want to actually commit the set of HTLCs we were building into an actual commitment transaction (we do this every 10ms, so that multiple HTLCs incoming/outgoing have a chance of being put together in the same commitment). Fat parts on top of that include status_fmt, similar to log_* functions. The problem is that log filtering is implemented in the main lightningd, and status_* functions always send to the main lightningd regardless of your log level. Indeed it seems to me, naively, that a good part of the time spent in status_vfmt is towire_status_log and possibly the write system call (which we can see rooted from __clone).

So it seems to me that a good part of the delay on Alice is indeed logging due to the status_ system, with the log filtering done in lightningd rather than in the subdaemon (so that log-level=broken has no effect on the load).

Indeed, under __clone we see a good part of the time is from write on the lightning_channeld --- most likely from the status_* dumps --- and read on the main lightningd. It is very likely their communication channel since under sys_write and sys_read we see a good part go into sock_ related communications, then into something that starts with unix_, probably UNIX local sockets (which are what we use to communicate between subdaemons).

NicolasDorier · 2018-05-24T06:29:56Z

So it seems that we can gain some nice improvement by filtering log at the child level instead of reporting back to ligthningd. I don't think the logs are causing the linear issue, but still it would be good improvement.

I noticed with htop that there is something leaking as well. Maybe those timers are not properly cleanup after expiration. This might explain the linear time issue. (Although I think there is something else, brute_force_first might increase but it still seems not the biggest perf issue)

NicolasDorier · 2018-05-24T07:37:53Z

Left column Alice, Right Bob. Vertical time (every 5 minutes)

Full resolution on https://aois.blob.core.windows.net/public/CLightningBenchmarkResults/traces.jpg

NicolasDorier · 2018-05-24T07:39:18Z

Raw data on all the tests I did.

Saibato · 2018-05-27T21:14:43Z

dump_htlcs is called in channel.c when calculating the signature
which yield's to n(htlc's) call's of status_trace
so additional log get's bigger by n(htlc's) which consumes an by n increasing amount of time.
Can you try to comment out the call in channel.c and rerun test's?
Maybe dump_htlcs should be anyway SUPERVERBOSE

lightning/channeld/full_channel.c

Lines 95 to 124 in 7f43771

    
           static void dump_htlc(const struct htlc *htlc, const char *prefix) 
        
           { 
        
           	enum htlc_state remote_state; 
        
           	if (htlc->state <= RCVD_REMOVE_ACK_REVOCATION) 
        
           		remote_state = htlc->state + 10; 
        
           	else 
        
           		remote_state = htlc->state - 10; 
        
           	status_trace("%s: HTLC %s %"PRIu64" = %s/%s %s", 
        
           		     prefix, 
        
           		     htlc_owner(htlc) == LOCAL ? "LOCAL" : "REMOTE", 
        
           		     htlc->id, 
        
           		     htlc_state_name(htlc->state), 
        
           		     htlc_state_name(remote_state), 
        
           		     htlc->r ? "FULFILLED" : htlc->fail ? "FAILED" : 
        
           		     htlc->malformed ? "MALFORMED" : ""); 
        
           } 
        
           void dump_htlcs(const struct channel *channel, const char *prefix) 
        
           { 
        
           	struct htlc_map_iter it; 
        
           	const struct htlc *htlc; 
        
           	for (htlc = htlc_map_first(channel->htlcs, &it); 
        
           	     htlc; 
        
           	     htlc = htlc_map_next(channel->htlcs, &it)) { 
        
           		dump_htlc(htlc, prefix); 
        
           	} 
        
           }

ZmnSCPxj · 2018-05-28T16:16:16Z

Thank you @Saibato, that is probably indeed the cause of the linear increase of processing time proportional to the number of payments done in parallel.

Probably the best thing to do in response is to propagate the loglevel to daemons somehow, and to suppress call to dump_htlcs if loglevel is too high. As a first cut, just sending the loglevel to channeld in the channel_init message, and making the subdaemon-level status_* functions include a new status_set_loglevel function, which will suppress the lower loglevel status_* functions (not make them send STATUS_* messages to lightningd). We would still have the filter at the lightningd level. Succeeding cuts could include this loglevel to other subdaemons as we identify cases where the logging greatly increases processing time.

ZmnSCPxj · 2018-05-29T11:45:12Z

Filtering at subdaemon level is not a good idea, actually. lightningd still saves the messages and it can be gotten at the getlog command. Perhaps the dump_htlcs can be reduced to be less verbose instead.

NicolasDorier · 2018-05-30T03:59:47Z

Will try today sorry for the delay!

NicolasDorier · 2018-05-30T05:15:04Z

@Saibato it worked. I just commented dump_htlcs.

NicolasDorier · 2018-05-30T05:21:44Z

NicolasDorier · 2018-05-30T05:28:53Z

Closing this as the problem is identified.

Moving conversation to #1530

NicolasDorier · 2018-06-01T06:07:54Z

Just reporting the landscape now

https://aois.blob.core.windows.net/public/temp/Alice.svg

https://aois.blob.core.windows.net/public/temp/Bob.svg

Seems better! Still can't get past 20 tx per second but at least it is stable. Does not seem to be bottlenecked by IO. Will play more with it, seeing if there is some low hanging fruits.

NicolasDorier closed this as completed May 15, 2018

NicolasDorier reopened this May 15, 2018

NicolasDorier mentioned this issue May 21, 2018

Usable Dockerfile for CLightning #1318

Closed

cdecker added logging performance labels May 21, 2018

NicolasDorier mentioned this issue May 30, 2018

Make dump_htlcs SUPERVERBOSE #1530

Merged

NicolasDorier closed this as completed May 30, 2018

Creating invoice + making payment is slower and slower #1506

Creating invoice + making payment is slower and slower #1506

Comments

NicolasDorier commented May 15, 2018

ZmnSCPxj commented May 15, 2018

NicolasDorier commented May 15, 2018 • edited Loading

NicolasDorier commented May 15, 2018

ZmnSCPxj commented May 15, 2018

NicolasDorier commented May 15, 2018

ZmnSCPxj commented May 15, 2018

NicolasDorier commented May 16, 2018

NicolasDorier commented May 16, 2018 • edited Loading

ZmnSCPxj commented May 16, 2018

NicolasDorier commented May 17, 2018

ZmnSCPxj commented May 17, 2018

rockstardev commented May 17, 2018

NicolasDorier commented May 20, 2018

ZmnSCPxj commented May 21, 2018

NicolasDorier commented May 21, 2018

NicolasDorier commented May 21, 2018 • edited Loading

ZmnSCPxj commented May 21, 2018

NicolasDorier commented May 21, 2018

ZmnSCPxj commented May 21, 2018

NicolasDorier commented May 21, 2018

NicolasDorier commented May 21, 2018

NicolasDorier commented May 21, 2018

cdecker commented May 21, 2018

NicolasDorier commented May 21, 2018 • edited Loading

NicolasDorier commented May 21, 2018

NicolasDorier commented May 21, 2018 • edited Loading

NicolasDorier commented May 22, 2018 • edited Loading

ZmnSCPxj commented May 22, 2018

NicolasDorier commented May 22, 2018

NicolasDorier commented May 22, 2018 • edited Loading

NicolasDorier commented May 22, 2018 • edited Loading

ZmnSCPxj commented May 22, 2018

ZmnSCPxj commented May 22, 2018

NicolasDorier commented May 23, 2018

NicolasDorier commented May 23, 2018 • edited Loading

NicolasDorier commented May 23, 2018 • edited Loading

NicolasDorier commented May 23, 2018

ZmnSCPxj commented May 23, 2018

NicolasDorier commented May 23, 2018 • edited Loading

NicolasDorier commented May 23, 2018 • edited Loading

NicolasDorier commented May 23, 2018

ZmnSCPxj commented May 23, 2018

NicolasDorier commented May 24, 2018

NicolasDorier commented May 24, 2018 • edited Loading

NicolasDorier commented May 24, 2018

Saibato commented May 27, 2018

ZmnSCPxj commented May 28, 2018

ZmnSCPxj commented May 29, 2018

NicolasDorier commented May 30, 2018

NicolasDorier commented May 30, 2018

NicolasDorier commented May 30, 2018

NicolasDorier commented May 30, 2018

NicolasDorier commented Jun 1, 2018 • edited Loading

NicolasDorier commented May 15, 2018 •

edited

Loading

NicolasDorier commented May 16, 2018 •

edited

Loading

NicolasDorier commented May 21, 2018 •

edited

Loading

NicolasDorier commented May 21, 2018 •

edited

Loading

NicolasDorier commented May 21, 2018 •

edited

Loading

NicolasDorier commented May 22, 2018 •

edited

Loading

NicolasDorier commented May 22, 2018 •

edited

Loading

NicolasDorier commented May 22, 2018 •

edited

Loading

NicolasDorier commented May 23, 2018 •

edited

Loading

NicolasDorier commented May 23, 2018 •

edited

Loading

NicolasDorier commented May 23, 2018 •

edited

Loading

NicolasDorier commented May 23, 2018 •

edited

Loading

NicolasDorier commented May 24, 2018 •

edited

Loading

NicolasDorier commented Jun 1, 2018 •

edited

Loading