Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[optimization] Use bytes slice pool for allocations #278

Closed
4 tasks
Wondertan opened this issue Apr 20, 2021 · 10 comments
Closed
4 tasks

[optimization] Use bytes slice pool for allocations #278

Wondertan opened this issue Apr 20, 2021 · 10 comments

Comments

@Wondertan
Copy link
Member

Context

We allocate plain byte slices extensively throughout the project repos and after some work with them discard allocated slices causing GC to clean them up. In some hot paths, like whole block processing flow, this surely causes additional pressure on both allocator and GC. Fortunately, there is a relatively simple trick to avoid that and reduce stress in general for all data allocations by reusing fixed-sized allocated buffers through sync.Pool as basic primitive, though we can rely on already existing libs with simple APIs not to reinvent the wheel.

Expected Gains

  • Stop-the-world takes much less CPU time.
  • Much less RAM usage for the long run execution, but in some cases can be observed in some e2e testing as well.
  • Less pressure on allocators, as mentioned already.

I would like to provide real numbers, but this is very application-specific, so for us, they could be very different. For my previous project, htop showed almost 2x RAM usage reduction after applying this to multiple places.

Options:

Has fewer stars and less used within Go comm, but has a smarter API. Actually, we already using it as we rely on IPFS/libp2p, thus using it directly would allow us to share allocated bytes within IPFS and LL code, what is generally good. I would just stick with this one.

From the gopher who created fasthttp. BTW, he is from Kyiv and I know him personally. He is obsessed with optimizations.

Places to be altered

  • Rsmt2d
  • NMT
  • Un/marshalling of core data types
  • Reactor's msg un/marshalling

pls add more

@Wondertan
Copy link
Member Author

Theissue is related to all sublibs we have in LL's org, but AFAIK we don't have an established process on how to create such root issues, so I decided to keep it in the core.

@Wondertan
Copy link
Member Author

@liamsi, pls attach it to the proper Project. Don't know what to choose

@liamsi
Copy link
Member

liamsi commented Apr 20, 2021

As all this sounds good but it's mostly about optimizations, I've attached this to Testnet. We could even do some of this after mainnet though.

AFAIK we don't have an established process on how to create such root issues, so I decided to keep it in the core.

For similar cases, that span multiple repositories, I've used cards in the projects (e.g. see the first entry in https://github.com/orgs/lazyledger/projects/4 or in https://github.com/orgs/lazyledger/projects/6).

In some hot paths, like whole block processing flow, this surely causes additional pressure on both allocator and GC.

IMO, the first step should be to actually measure this via a profiler and benchmarks to see if that actually has any impact. Especially compared to overall systems latency. I'd expect the main bottleneck will be IO / network comms (it always is). And the gains through low-level optimizations like the mentioned ones are often rather minuscule in the grand scheme of things. We also need to measure network latencies via our ipld experiments (or via that testframework protocol labs developed). I think measuring latencies for retrieving block data as well as sampling da proofs has much higher priority (e.g. see: https://github.com/lazyledger/ipld-plugin-experiments/issues).

@liamsi
Copy link
Member

liamsi commented Apr 20, 2021

rsmt2d is a bit different with this regard, as we are currently refactoring it to streamline the APIs and make it production-ready. So if you have good ideas there, you can sync with @adlerjohn who has recently added benchmarks there.

@Wondertan
Copy link
Member Author

For the new rsmt2d API we can add something like Close/Destruct/Free and etc semantics for a user to free(get back to the pool) resources once processing ended.

@Wondertan
Copy link
Member Author

As all this sounds good but it's mostly about optimizations

Totally agree, but It would hurt to file an issue for the future.

IMO, the first step should be to actually measure this via a profiler and benchmarks to see if that actually has any impact.

But measuring differences would require us to implement that first, so it hard to understand the exact numbers of gains without doing actual work here.

I'd expect the main bottleneck will be IO / network comms (it always is)actually to implement that first

This is surely not about bottlenecks but better resource management within the application with predictable gains. Imagine how many times we allocate fixed-sized slices for samples every block when we can reuse them by allocating them once, hiding from GC in sync.Pool after some processing, and then accessing again when a new block arrives. And there are other cases like this. Unfortunately, we cannot fully manage memory in Go ourselves(btw, I would love to see toggleable GC in Go). Still, we are smart enough to overcome some of the disadvantages of GC.

@liamsi
Copy link
Member

liamsi commented Apr 20, 2021

But measuring differences would require us to implement that first, so it hard to understand the exact numbers of gains without doing actual work here.

Yeah, you are right about measuring the difference of course. I'm just saying that we should first profile how much time and how many allocs happen because of GC (of slices) first.

It's certainly good that you opened an issue about it!

@tzdybal
Copy link
Member

tzdybal commented Apr 22, 2021

"Premature optimization is the root of all evil". Definitely good place to find some CPU cycles and memory, but I totally support @liamsi on this - we probably can do it even after mainnet.

@liamsi
Copy link
Member

liamsi commented Dec 8, 2021

Is this still relevant? I think @Wondertan went and did this in node anyways?
Moved over to node.

@liamsi liamsi transferred this issue from celestiaorg/celestia-core Dec 8, 2021
@renaynay renaynay changed the title Use bytes slice pool for allocations [optimization] Use bytes slice pool for allocations Apr 26, 2022
@renaynay renaynay added the enhancement New feature or request label Apr 26, 2022
@Bidon15
Copy link
Member

Bidon15 commented Jul 12, 2022

Grooming 12/07/2022:

This could be a part of a later epic describing the RAM optimization roadmap

@celestiaorg celestiaorg locked and limited conversation to collaborators Jan 22, 2024
@ramin ramin converted this issue into discussion #3127 Jan 22, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants