Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: design document for the border router redesign #4339

Merged
merged 8 commits into from
Jun 1, 2023

Conversation

rohrerj
Copy link
Contributor

@rohrerj rohrerj commented Apr 20, 2023

This pullrequest contains the design document for the border router redesign. See proposal 4334.


This change is Reviewable

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the design document! Sorry for the many comments; the idea is that the more things we discuss here in detail, the less we'll need to iterate during the implementation.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @rohrerj)


doc/BorderRouter.md line 1 at r1 (raw file):

# Border Router Redesign

Format nitpicks:

Could you use the TEMPLATE.rst, please? For context, see the preview of the still not finalized #4325.
With

Regarding the title, I'd prefer something slightly more specific.


doc/BorderRouter.md line 37 at r1 (raw file):

Then the receiver enqueues the packet to that processing routine.
If the queue of that processing routine is full, the packet will be dropped.
Each receiver has a pool of preallocated packet buffers that they can use to store the packets they

It's not clear to me that having a separate "pool" for each receiver is really beneficial. The size of this pool should be just big enough to ensure that the reader is never starved for packet buffers. For this, it needs to contain as many packets as can be in-flight through the other components plus a read batch size.
If we use a separate pool for each reader, each one of these need to be big enough. There is also a slight complication in the house keeping, as the packet buffers need to be returned to the correct pool.
With a shared pool, the total number of pre-allocated packets can be significantly smaller.

The individual pools could come out advantageous if we want to consider memory locality, in particular on NUMA systems. However, as binding fixing goroutines/threads to cores is not currently in scope, this is not applicable.

One more consideration in favor of shared pool: it becomes quite simple to keep batches of packet buffers directly in the pool; the reader can pop B packets in one go from the pool; the processor and forwarder's queues necessarily consist of single packets though. The forwarder can easily re-batch the packets when before adding back to the pool. While this is also possible with separate pools, it would need more state in the writer and generally even more pre-allocated packet buffers.
My gut feeling is that this batching would easily make up for the reader contention on the pool.

Code quote:

Each receiver has a pool of preallocated packet buffers that they can use to store the packets they
receive.

doc/BorderRouter.md line 39 at r1 (raw file):

Each receiver has a pool of preallocated packet buffers that they can use to store the packets they
receive.
This can be implemented as a go channel of packets where the receiver reads a certain amount of packets,

For the pool, it seems that a stack (LIFO) could work better than a channel/queue (FIFO). If we overestimate the pool size (which is likely), a queue forces packet buffers to "cool down" between uses, instead of keeping a smaller subset of the buffers "warm" in the caches.


doc/BorderRouter.md line 70 at r1 (raw file):

### Mapping of processing routines

To prevent any packet reordering on the fast-path, we map the tuple of source and flowID, see the

Is "source" both ISD-AS and local address?

A flow is defined by source, destination and flow ID. Why not include the destination address here as well? As the destination address comes before the source in SCION packets, this would not increase the size of the packet "prefix" inspected before the actual processing.

Also, I would probably include the DT/DL/ST/SL field in the hash to avoid any potential for type confusion (I can't think of a way how this would be exploited, but the additional cost of including these 8 bits seems small enough to just add it out of caution).

Btw. how do we extract the information for the hash input from the packet? The slayers.SCION layer does more than what we need and I guess that every bit of processing overhead in the reader routines hurts.


doc/BorderRouter.md line 72 at r1 (raw file):

To prevent any packet reordering on the fast-path, we map the tuple of source and flowID, see the
[SCION header documentation](https://github.com/scionproto/scion/blob/master/doc/protocols/scion-header.rst),
to a fixed processing routine using a hash function together with a random value which is generated

Which hash function?

💯 for the random value.


doc/BorderRouter.md line 75 at r1 (raw file):

on startup to prevent pre-computations of the exact mapping.

### Slow path

Not really "slow" path, but still sort of related; should/could we separate the processing of received BFD messages into a separate queue to really not drop these even if the router is very busy? Or is this a bad idea? I can't quite make up my mind. 😜

Btw, currently sending BFD messages just WriteTos to the socket (bfdSend.Send). This will probably not be good for the throughput in the forwarder routines. Perhaps they should have access to the packet buffer pool and instead of WriteToing, insert this into the queue of the correct forwarder routine.
FWIW, I think this can be addressed in a separate follow up step.


doc/BorderRouter.md line 78 at r1 (raw file):

During processing, packets that have to follow the slow path are identified and forwarded to the
slow-path processing routines.

This is not clear enough for me. How does this hand-off work? I can imagine that the fast-path would bail out with some processingResult (or error) indicating that the packet needs to be resubmitted to the slow path. This could include some details on the problem (e.g. including SCMP error type and pointer), so that the slow-path doesn't need to repeat all the processing. We need to be careful though to share only data that can be copied into the queue, without causing allocations.
I think I would then be two variants of the scionPacketProcessor, one fast and one slow.

Please describe in more details how this mechanism would look like.


doc/BorderRouter.md line 102 at r1 (raw file):

An optimal value should be derivable from the maximum latency of processing and forwarding a packet,
the speed of the network interfaces and the number of available CPU cores.
In the future we might add the possibility to automatically estimate this number.

As commented above, I'd rather go with a shared pool. The pool needs to be just big enough to ensure that the readers are not starved for packet buffers. We can quite easily ensure this by computing the max number of packets in-flight through the system. Roughly:

pool_size := numReaders * readBatch + numProcessors * (processorQueueSize + 1)  + numWriters * (writerQueueSize + writeBatchSize) // + slow path  + bfd processors + ... ?

Possibly it could make sense to allow an upper bound for this on memory constrained systems -- however, even with rather large numbers (e.g. 100 interfaces, 32 processors) I only get fifty megabytes for this. That's not insignificant of course, but I'd argue that this is acceptable for the types of system that will be used to run such a setup.
I think we can remove this knob for now.


doc/BorderRouter.md line 109 at r1 (raw file):

to process packets in parallel.
An optimal value should be derivable from the maximum latency of processing and forwarding a packet,
the speed of the network interfaces and the number of available CPU cores.

Does it ever make sense to have more processing routines than the number of cores? These processors don't do I/O (other than interact with their input/output queues) or any other blocking operations. If there are more, they'd just be "swapping".


doc/BorderRouter.md line 116 at r1 (raw file):

process the packets on the slow-path.
An optimal value could be a percentage of the number of processing routines or even a fixed number.
In the future we might add the possibility to automatically estimate this number.

I think a sensible default would be 1.


doc/BorderRouter.md line 122 at r1 (raw file):

Since each processing routine has a queue of packets to process and all packets not fitting in the queue
are dropped, one has to specify a reasonable queue size.
A default value could be 1024.

Recall that this is a userspace program and that we already have buffers on the underlying OS network stack both on the receive and transmit sides. We don't need to, and shouldn't, add more than minimal internal buffering.

The internal buffers should be big enough to accommodate for bursts caused by the internal structure, but not bigger. For the processing routines, the queue should be able to hold a read batch, and the sum of the queue sizes of the processing goroutines should be big enough to contain the read batches from all the reader routines.
The writer routines' queue should only be big enough for a single batch write.

I would not expose this as a configuration option. Operators should use the knob of the OS network stack first.


doc/BorderRouter.md line 137 at r1 (raw file):

The CPU affinity by locking the goroutines to threads and CPU cores can later be studied and would become
much easier by applying this design because we have more goroutines and hence more flexibility how we
want to fix them to the cores.

That doesn't sound right. With this new design we have so many goroutines that using a separate thread for each one (and that's the only way to lock goroutines to threads and cores) seems excessive.


doc/BorderRouter.md line 174 at r1 (raw file):

* Restructure the router/dataplane.go file to have a reading, processing and forwarding functionality

* Add buffer reuse support

These two should be kept together, the buffer recycling is a vital part of the concept.
Even including this buffer recycling, I think that the implementation of this step should only affect Dataplane.Run (plus some new helper functionality).

Perhaps one candidate for moving to a separate PR could be optimizing to extract the flow information for the hash input in the reader routine. That is, use slayers.SCION in a first version for this, and optimize this later with a ship load of tests to ensure that the parsing logic is correct.

Code quote:

* Restructure the router/dataplane.go file to have a reading, processing and forwarding functionality

* Add buffer reuse support

Copy link
Contributor Author

@rohrerj rohrerj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 13 unresolved discussions (waiting on @matzf)


doc/BorderRouter.md line 1 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Format nitpicks:

Could you use the TEMPLATE.rst, please? For context, see the preview of the still not finalized #4325.
With

Regarding the title, I'd prefer something slightly more specific.

I moved everything to doc/BorderRouter.rst which uses this template.


doc/BorderRouter.md line 37 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

It's not clear to me that having a separate "pool" for each receiver is really beneficial. The size of this pool should be just big enough to ensure that the reader is never starved for packet buffers. For this, it needs to contain as many packets as can be in-flight through the other components plus a read batch size.
If we use a separate pool for each reader, each one of these need to be big enough. There is also a slight complication in the house keeping, as the packet buffers need to be returned to the correct pool.
With a shared pool, the total number of pre-allocated packets can be significantly smaller.

The individual pools could come out advantageous if we want to consider memory locality, in particular on NUMA systems. However, as binding fixing goroutines/threads to cores is not currently in scope, this is not applicable.

One more consideration in favor of shared pool: it becomes quite simple to keep batches of packet buffers directly in the pool; the reader can pop B packets in one go from the pool; the processor and forwarder's queues necessarily consist of single packets though. The forwarder can easily re-batch the packets when before adding back to the pool. While this is also possible with separate pools, it would need more state in the writer and generally even more pre-allocated packet buffers.
My gut feeling is that this batching would easily make up for the reader contention on the pool.

You are right. I changed the design document such that all receivers use the same pool.


doc/BorderRouter.md line 39 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

For the pool, it seems that a stack (LIFO) could work better than a channel/queue (FIFO). If we overestimate the pool size (which is likely), a queue forces packet buffers to "cool down" between uses, instead of keeping a smaller subset of the buffers "warm" in the caches.

I think using a queue would be also a good option, at least for the basic implemention, because in LIFO all go routines would always need to access the "fist" element and hence all would have to fight for the lock.


doc/BorderRouter.md line 70 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Is "source" both ISD-AS and local address?

A flow is defined by source, destination and flow ID. Why not include the destination address here as well? As the destination address comes before the source in SCION packets, this would not increase the size of the packet "prefix" inspected before the actual processing.

Also, I would probably include the DT/DL/ST/SL field in the hash to avoid any potential for type confusion (I can't think of a way how this would be exploited, but the additional cost of including these 8 bits seems small enough to just add it out of caution).

Btw. how do we extract the information for the hash input from the packet? The slayers.SCION layer does more than what we need and I guess that every bit of processing overhead in the reader routines hurts.

I changed the text to make it explicit that the ISD-AS and local address are used. And i think it is not required to include the DT/DL/ST/SL fields because they are already used to extract the local address.
My idea would be to not use the slayers.SCION parser because it parses a lot more that we need at the receiver but to just extract the bytes that we need.


doc/BorderRouter.md line 72 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Which hash function?

💯 for the random value.

My idea is to use the fnv-1a hash function because it is very fast. But I am open for other suggestions.


doc/BorderRouter.md line 75 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Not really "slow" path, but still sort of related; should/could we separate the processing of received BFD messages into a separate queue to really not drop these even if the router is very busy? Or is this a bad idea? I can't quite make up my mind. 😜

Btw, currently sending BFD messages just WriteTos to the socket (bfdSend.Send). This will probably not be good for the throughput in the forwarder routines. Perhaps they should have access to the packet buffer pool and instead of WriteToing, insert this into the queue of the correct forwarder routine.
FWIW, I think this can be addressed in a separate follow up step.

I will think about whether I can do this as part of the slow-path change or whether it will become future work.


doc/BorderRouter.md line 78 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

This is not clear enough for me. How does this hand-off work? I can imagine that the fast-path would bail out with some processingResult (or error) indicating that the packet needs to be resubmitted to the slow path. This could include some details on the problem (e.g. including SCMP error type and pointer), so that the slow-path doesn't need to repeat all the processing. We need to be careful though to share only data that can be copied into the queue, without causing allocations.
I think I would then be two variants of the scionPacketProcessor, one fast and one slow.

Please describe in more details how this mechanism would look like.

I added a rough idea but as soon as the basic implementation is done it should be possible to explain this in more detail.


doc/BorderRouter.md line 102 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

As commented above, I'd rather go with a shared pool. The pool needs to be just big enough to ensure that the readers are not starved for packet buffers. We can quite easily ensure this by computing the max number of packets in-flight through the system. Roughly:

pool_size := numReaders * readBatch + numProcessors * (processorQueueSize + 1)  + numWriters * (writerQueueSize + writeBatchSize) // + slow path  + bfd processors + ... ?

Possibly it could make sense to allow an upper bound for this on memory constrained systems -- however, even with rather large numbers (e.g. 100 interfaces, 32 processors) I only get fifty megabytes for this. That's not insignificant of course, but I'd argue that this is acceptable for the types of system that will be used to run such a setup.
I think we can remove this knob for now.

Done.


doc/BorderRouter.md line 109 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Does it ever make sense to have more processing routines than the number of cores? These processors don't do I/O (other than interact with their input/output queues) or any other blocking operations. If there are more, they'd just be "swapping".

No, it does not make sense to use more processing routines than the number of cores.


doc/BorderRouter.md line 116 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

I think a sensible default would be 1.

Done.


doc/BorderRouter.md line 122 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Recall that this is a userspace program and that we already have buffers on the underlying OS network stack both on the receive and transmit sides. We don't need to, and shouldn't, add more than minimal internal buffering.

The internal buffers should be big enough to accommodate for bursts caused by the internal structure, but not bigger. For the processing routines, the queue should be able to hold a read batch, and the sum of the queue sizes of the processing goroutines should be big enough to contain the read batches from all the reader routines.
The writer routines' queue should only be big enough for a single batch write.

I would not expose this as a configuration option. Operators should use the knob of the OS network stack first.

I removed this part from the configuration section.


doc/BorderRouter.md line 137 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

That doesn't sound right. With this new design we have so many goroutines that using a separate thread for each one (and that's the only way to lock goroutines to threads and cores) seems excessive.

I removed the second part of the sentence.


doc/BorderRouter.md line 174 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

These two should be kept together, the buffer recycling is a vital part of the concept.
Even including this buffer recycling, I think that the implementation of this step should only affect Dataplane.Run (plus some new helper functionality).

Perhaps one candidate for moving to a separate PR could be optimizing to extract the flow information for the hash input in the reader routine. That is, use slayers.SCION in a first version for this, and optimize this later with a ship load of tests to ensure that the parsing logic is correct.

I combined those two.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @rohrerj)


doc/BorderRouter.rst line 95 at r2 (raw file):

slow-path processing routines.
To do so, we hand over the current buffer to the slow-path routine together with all the information
that the processing routine already computed that are required by the slow-path processing routine.

Definitely not all the information. The pre-allocated layer and path helper objects in the scionPacketProcessor must not be shared between goroutines. Otherwise we'd need to keep re-allocating new memory for these (not acceptable), or need more infrastructure to recycle these explicitly (e.g just like the packet buffer recycling). The latter seems excessively complicated. I think that it's preferable to only pass on minimal context information to the slow path (e.g. just the SCMP error codes) and redo some of the parsing if necessary.


doc/BorderRouter.rst line 120 at r2 (raw file):

.. code-block:: text

    pool_size := numReaders * readBatch + numProcessors * (processorQueueSize + 1) + numWriters * (writerQueueSize + writeBatchSize)

IMO this should not be configurable.
Keep the content (just move it up to the "design" section) and also describe how the queue sizes are to be set.
The read/write batch sizes should perhaps be platform specific values, as the batch syscalls are only implemented for linux.


doc/BorderRouter.md line 39 at r1 (raw file):

Previously, rohrerj wrote…

I think using a queue would be also a good option, at least for the basic implemention, because in LIFO all go routines would always need to access the "fist" element and hence all would have to fight for the lock.

I'm not sure that contention would really be significantly worse with a stack than with a queue. But anyway ok to use a queue for an initial implementation.


doc/BorderRouter.md line 70 at r1 (raw file):
Ok.

... triple of source (ISD-AS + local address), destination (ISD-AS + local address) and flowID, ...

"Local address" is a somewhat confusing term here. Use the field names from the address header spec/documentation (DstISD, DstAS, DstHostAddr, ...) or just say "the full :ref:`address-header`.".

Ok to extract the relevant bytes directly. Can you please note this somewhere in the document and also note that this should have a set of tests to ensure that the behaviour is identical to the slayers.SCION parsing.


doc/BorderRouter.md line 72 at r1 (raw file):

Previously, rohrerj wrote…

My idea is to use the fnv-1a hash function because it is very fast. But I am open for other suggestions.

Interesting. I was a bit surprised that we use a non-cryptographic hash function; I guess now that the output range is so small (just the number of processing routines, so at the very most a few hundred), so that having a cryptographic hash wouldn't make it significantly harder to find collisions. Is this roughly right, or is there a different justification for using a non-cryptographic hash?

The "sticky zero" property of fnv seems slightly worrisome to me. At the very least, we ensure to put the random value at the beginning of the hash input, to ensure that it isn't ignored due to a sticky zero.


doc/BorderRouter.md line 78 at r1 (raw file):

Previously, rohrerj wrote…

I added a rough idea but as soon as the basic implementation is done it should be possible to explain this in more detail.

AFAIU, the basic implementation does not affect the scionPacketProcessor at all -- perhaps this should be discussed explicitly, e.g. in the implementation section.

So IMO it should be possible to explain this in more details already now.


doc/BorderRouter.md line 109 at r1 (raw file):

Previously, rohrerj wrote…

No, it does not make sense to use more processing routines than the number of cores.

Maybe then runtime.GOMAXPROCS would be a sensible way to determine the number of these processing routines. This defaults to the number of cores, and AFAIU already takes into account any CPU affinity masks etc. This would almost always be the right value. And if an operator still needs to tweak it, we don't even need a separate knob, because it can be set with the GOMAXPROCS environment variable.

Copy link
Contributor Author

@rohrerj rohrerj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 3 of 4 files reviewed, 6 unresolved discussions (waiting on @matzf)


doc/BorderRouter.rst line 95 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

Definitely not all the information. The pre-allocated layer and path helper objects in the scionPacketProcessor must not be shared between goroutines. Otherwise we'd need to keep re-allocating new memory for these (not acceptable), or need more infrastructure to recycle these explicitly (e.g just like the packet buffer recycling). The latter seems excessively complicated. I think that it's preferable to only pass on minimal context information to the slow path (e.g. just the SCMP error codes) and redo some of the parsing if necessary.

I changed the text to pass only the buffer together with the error codes and that if we need more we have to redo certain computations.


doc/BorderRouter.rst line 120 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

IMO this should not be configurable.
Keep the content (just move it up to the "design" section) and also describe how the queue sizes are to be set.
The read/write batch sizes should perhaps be platform specific values, as the batch syscalls are only implemented for linux.

I moved the pool size out of the configuration chapter but created configuration entries for the read / write batch and processing routine / forwarder channel size.


doc/BorderRouter.md line 70 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Ok.

... triple of source (ISD-AS + local address), destination (ISD-AS + local address) and flowID, ...

"Local address" is a somewhat confusing term here. Use the field names from the address header spec/documentation (DstISD, DstAS, DstHostAddr, ...) or just say "the full :ref:`address-header`.".

Ok to extract the relevant bytes directly. Can you please note this somewhere in the document and also note that this should have a set of tests to ensure that the behaviour is identical to the slayers.SCION parsing.

I changed the naming to the full address header.


doc/BorderRouter.md line 72 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Interesting. I was a bit surprised that we use a non-cryptographic hash function; I guess now that the output range is so small (just the number of processing routines, so at the very most a few hundred), so that having a cryptographic hash wouldn't make it significantly harder to find collisions. Is this roughly right, or is there a different justification for using a non-cryptographic hash?

The "sticky zero" property of fnv seems slightly worrisome to me. At the very least, we ensure to put the random value at the beginning of the hash input, to ensure that it isn't ignored due to a sticky zero.

Exactly. I also added a note regarding the sticky zero property.


doc/BorderRouter.md line 78 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

AFAIU, the basic implementation does not affect the scionPacketProcessor at all -- perhaps this should be discussed explicitly, e.g. in the implementation section.

So IMO it should be possible to explain this in more details already now.

I added some more details regarding the slow-path processing routines.


doc/BorderRouter.md line 109 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Maybe then runtime.GOMAXPROCS would be a sensible way to determine the number of these processing routines. This defaults to the number of cores, and AFAIU already takes into account any CPU affinity masks etc. This would almost always be the right value. And if an operator still needs to tweak it, we don't even need a separate knob, because it can be set with the GOMAXPROCS environment variable.

I think adding a configuration entry makes more sense because otherwise we would have 2 problems.

  1. we don't take the number of CPU cores into account for the receivers, forwarders and slow-path processing routines and
  2. one instance of a border router might spawn too many goroutines if also other services run on the same system. This could be solved by manually setting that environment variable but a config entry would work too.
    I personally would prefer to use a config entry.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @rohrerj)


doc/BorderRouter.rst line 120 at r2 (raw file):
Ok.
The queue size configuration has been removed in the previous iteration. The queue sizes should be determined automatically. Citing my own previous comment:

The internal buffers should be big enough to accommodate for bursts caused by the internal structure, but not bigger. For the processing routines, the queue should be able to hold a read batch, and the sum of the queue sizes of the processing goroutines should be big enough to contain the read batches from all the reader routines.
The writer routines' queue should only be big enough for a single batch write.

Expressed differently:

processorQueueSize := max( ceil(numReaders * readBatch / numProcessors) , readBatch )
writerQueueSize := writeBatchSize

Maybe mention this in the document (just before the pool_size definition would seem like a good place).

Adding a configuration option for the batch size seems fine. As it's closely related to the UDP read and write buffer, perhaps we should also add a setting for these (currently hard-coded to 1MB, subject to system rmem_max).


doc/BorderRouter.rst line 159 at r3 (raw file):

from / to a network socket and how many packets can be enqueued at the processing routines and the
forwarders before packets are getting dropped.
A default value for both queue size and batch size would be 64.

The only limiting factor for the batch size is the memory overhead for preallocating all these packet buffers. Otherwise, having a bigger batch size seems better. I would thus choose a somewhat aggressive default, e.g. 256, so that it only needs to be tuned to fit constrained systems.


doc/BorderRouter.md line 109 at r1 (raw file):

Previously, rohrerj wrote…

I think adding a configuration entry makes more sense because otherwise we would have 2 problems.

  1. we don't take the number of CPU cores into account for the receivers, forwarders and slow-path processing routines and
  2. one instance of a border router might spawn too many goroutines if also other services run on the same system. This could be solved by manually setting that environment variable but a config entry would work too.
    I personally would prefer to use a config entry.

The advantage of the GOMAXPROC is exactly that it is the correct value to set by the operator if other services are also running on the same system. Or the other way around, if the operator isolates different services by restricting their CPU masks, the GOMAXPROC will already give us the right number.

The reader/writer goroutines are (almost) pure IO, so these should not be counted. Ideally they'd be running on the same threads as the processing goroutine for the packets, to exploit any locality (but we cannot influence the scheduling much).

The slow path goroutines I'm not sure about. If these get a separate thread to run on, there is very little rate limiting going on.

Copy link
Contributor Author

@rohrerj rohrerj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 3 unresolved discussions (waiting on @matzf)


doc/BorderRouter.md line 109 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

The advantage of the GOMAXPROC is exactly that it is the correct value to set by the operator if other services are also running on the same system. Or the other way around, if the operator isolates different services by restricting their CPU masks, the GOMAXPROC will already give us the right number.

The reader/writer goroutines are (almost) pure IO, so these should not be counted. Ideally they'd be running on the same threads as the processing goroutine for the packets, to exploit any locality (but we cannot influence the scheduling much).

The slow path goroutines I'm not sure about. If these get a separate thread to run on, there is very little rate limiting going on.

I changed the part in the configuration section to just use the value from the environment variable. So it can be configured by setting the environment variable.


doc/BorderRouter.rst line 120 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

Ok.
The queue size configuration has been removed in the previous iteration. The queue sizes should be determined automatically. Citing my own previous comment:

The internal buffers should be big enough to accommodate for bursts caused by the internal structure, but not bigger. For the processing routines, the queue should be able to hold a read batch, and the sum of the queue sizes of the processing goroutines should be big enough to contain the read batches from all the reader routines.
The writer routines' queue should only be big enough for a single batch write.

Expressed differently:

processorQueueSize := max( ceil(numReaders * readBatch / numProcessors) , readBatch )
writerQueueSize := writeBatchSize

Maybe mention this in the document (just before the pool_size definition would seem like a good place).

Adding a configuration option for the batch size seems fine. As it's closely related to the UDP read and write buffer, perhaps we should also add a setting for these (currently hard-coded to 1MB, subject to system rmem_max).

I added the processorQueueSize formula to the design document. I also added a configuration option for the UDP read and write buffer in the configuration section.


doc/BorderRouter.rst line 159 at r3 (raw file):

Previously, matzf (Matthias Frei) wrote…

The only limiting factor for the batch size is the memory overhead for preallocating all these packet buffers. Otherwise, having a bigger batch size seems better. I would thus choose a somewhat aggressive default, e.g. 256, so that it only needs to be tuned to fit constrained systems.

Done.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @rohrerj)


doc/dev/design/BorderRouter.rst line 125 at r4 (raw file):

.. code-block:: text
    

Nit. remove trailing white space.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @rohrerj)

@matzf matzf merged commit 5ba8660 into scionproto:master Jun 1, 2023
matzf pushed a commit that referenced this pull request Jul 18, 2023
Restructure the border router's packet processing loop into receiver,
processing and forwarders goroutines.

This is implementation part one of three described in the design
document (github.com//pull/4339, doc/dev/design/BorderRouter.rst).
matzf pushed a commit that referenced this pull request Aug 4, 2023
Implement configuration for the border router processing structure
added in #4351.

This is implementation part two of three described in the design
document (#4339, doc/dev/design/BorderRouter.rst).
matzf pushed a commit that referenced this pull request Sep 28, 2023
Add a low-priority slow path for special case packet handling, in
particular SCMP errors and traceroutes, for the the asynchronous packet
processing pipeline in the router added in #4351.

This is implementation part three of three described in the design
document (#4339, doc/dev/design/BorderRouter.rst).

Closes #4334
jiceatscion pushed a commit to jiceatscion/scion that referenced this pull request Sep 28, 2023
Add a low-priority slow path for special case packet handling, in
particular SCMP errors and traceroutes, for the the asynchronous packet
processing pipeline in the router added in scionproto#4351.

This is implementation part three of three described in the design
document (scionproto#4339, doc/dev/design/BorderRouter.rst).

Closes scionproto#4334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants