Skip to content

Commit

Permalink
feat(config): allow separate config for peer and incoming span queue (#…
Browse files Browse the repository at this point in the history
…916)

## Which problem is this PR solving?

- #826 

## Short description of the changes

This PR adds two new configuration option, `Collection.PeerQueueSize`
and `Collection.IncomingQueueSize`. This enables users to have more
control over each span queue size based on their refinery setup.

Before this change, both queue sizes are defined as
3*`Collection.CacheCapacity`. Since the queue size should always be
larger than the trace cache, I decided to set the minimum value for both
queue size to be 3*`Collection.CacheCapacity`. I'm not sure if there's a
use case to have a smaller `PeerQueueSize` if refinery is not running in
cluster. I would love to get some feedback on that.

fix #826
  • Loading branch information
VinozzZ authored Nov 30, 2023
1 parent ba71dac commit b975a87
Show file tree
Hide file tree
Showing 8 changed files with 248 additions and 48 deletions.
4 changes: 2 additions & 2 deletions collect/collect.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,8 @@ func (i *InMemCollector) Start() error {
return err
}

i.incoming = make(chan *types.Span, imcConfig.CacheCapacity*3)
i.fromPeer = make(chan *types.Span, imcConfig.CacheCapacity*3)
i.incoming = make(chan *types.Span, imcConfig.GetIncomingQueueSize())
i.fromPeer = make(chan *types.Span, imcConfig.GetPeerQueueSize())
i.Metrics.Store("INCOMING_CAP", float64(cap(i.incoming)))
i.Metrics.Store("PEER_CAP", float64(cap(i.fromPeer)))
i.reload = make(chan struct{}, 1)
Expand Down
58 changes: 49 additions & 9 deletions config.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Honeycomb Refinery Configuration Documentation

This is the documentation for the configuration file for Honeycomb's Refinery.
It was automatically generated on 2023-11-29 at 18:47:22 UTC.
It was automatically generated on 2023-11-30 at 19:50:23 UTC.

## The Config file

Expand All @@ -10,7 +10,9 @@ The file is split into sections; each section is a group of related configuratio
Each section has a name, and the name is used to refer to the section in other parts of the config file.

## Sample

This is a sample config file:

```yaml
General:
ConfigurationVersion: 2
Expand All @@ -25,6 +27,7 @@ OTelMetrics:
The remainder of this document describes the sections within the file and the fields in each.
## Table of Contents
- [General Configuration](#general-configuration)
- [Network Configuration](#network-configuration)
- [Access Key Configuration](#access-key-configuration)
Expand All @@ -46,9 +49,11 @@ The remainder of this document describes the sections within the file and the fi
- [gRPC Server Parameters](#grpc-server-parameters)
- [Sample Cache](#sample-cache)
- [Stress Relief](#stress-relief)
## General Configuration
`General` contains general configuration options that apply to the entire Refinery process.

### `ConfigurationVersion`

ConfigurationVersion is the file format of this particular configuration file.
Expand Down Expand Up @@ -99,6 +104,7 @@ If the config file is being loaded from a URL, it may be wise to increase this v
## Network Configuration

`Network` contains network configuration options.

### `ListenAddr`

ListenAddr is the address where Refinery listens for incoming requests.
Expand Down Expand Up @@ -165,6 +171,7 @@ If `false`, then all traffic is accepted and `ReceiveKeys` is ignored.
## Refinery Telemetry

`RefineryTelemetry` contains configuration information for the telemetry that Refinery uses to record its own operation.

### `AddRuleReasonToTrace`

AddRuleReasonToTrace controls whether to decorate traces with Refinery rule evaluation results.
Expand Down Expand Up @@ -214,6 +221,7 @@ If `true`, then Refinery will add the following tag to all traces: - `meta.refin
## Traces

`Traces` contains configuration for how traces are managed.

### `SendDelay`

SendDelay is the duration to wait before sending a trace.
Expand Down Expand Up @@ -278,6 +286,7 @@ Decreasing this will check the trace cache for timeouts more frequently.
## Debugging

`Debugging` contains configuration values used when setting up and debugging Refinery.

### `DebugServiceAddr`

DebugServiceAddr is the IP and port where the debug service runs.
Expand Down Expand Up @@ -332,6 +341,7 @@ In addition, `SampleRate` will be set to the incoming rate for all traces, and t
## Refinery Logger

`Logger` contains configuration for logging.

### `Type`

Type is the type of logger to use.
Expand Down Expand Up @@ -362,6 +372,7 @@ Level is the logging level above which Refinery should send a log to the logger.

`HoneycombLogger` contains configuration for logging to Honeycomb.
Only used if `Logger.Type` is "honeycomb".

### `APIHost`

APIHost is the URL of the Honeycomb API where Refinery sends its logs.
Expand Down Expand Up @@ -419,6 +430,7 @@ The sampling algorithm attempts to make sure that the average throughput approxi

`StdoutLogger` contains configuration for logging to `stdout`.
Only used if `Logger.Type` is "stdout".

### `Structured`

Structured controls whether to use structured logging.
Expand Down Expand Up @@ -452,6 +464,7 @@ The sampling algorithm attempts to make sure that the average throughput approxi
## Prometheus Metrics

`PrometheusMetrics` contains configuration for Refinery's internally-generated metrics as made available through Prometheus.

### `Enabled`

Enabled controls whether to expose Refinery metrics over the `PrometheusListenAddr` port.
Expand Down Expand Up @@ -605,6 +618,7 @@ In rare circumstances, compression costs may outweigh the benefits, in which cas
## Peer Management

`PeerManagement` controls how the Refinery cluster communicates between peers.

### `Type`

Type is the type of peer management to use.
Expand Down Expand Up @@ -771,26 +785,52 @@ This is not recommended for production use since a burst of traffic could cause
CacheCapacity is the number of traces to keep in the cache's circular buffer.

The collection cache is used to collect all spans into a trace as well as remember the sampling decision for any spans that might come in after the trace has been marked "complete" (either by timing out or seeing the root span).
The number of traces in the cache should be many multiples (100x to 1000x) of the total number of concurrently active traces (trace throughput * trace duration).
The number of traces in the cache should be many multiples (100x to 1000x) of the total number of concurrently active traces (trace throughput \* trace duration).

- Eligible for live reload.
- Type: `int`
- Default: `10000`

### `PeerQueueSize`

PeerQueueSize is the maximum number of in-flight spans redirected from other peers stored in the peer span queue.

The peer span queue serves as a buffer for spans redirected from other peers before they are processed.
In the event that this queue reaches its capacity, any subsequent spans will be discarded.
The size of this queue is contingent upon the number of peers within the cluster.
Specifically, with N peers, the queue's span capacity is determined by (N-1)/N of the total number of spans.
Its minimum value should be at least three times the CacheCapacity.

- Not eligible for live reload.
- Type: `int`
- Default: `30000`

### `IncomingQueueSize`

IncomingQueueSize is the number of in-flight spans to keep in the incoming span queue.

The incoming span queue is used to buffer spans before they are processed.
If this queue fills up, then subsequent spans will be dropped.
Its minimum value should be at least three times the CacheCapacity.

- Not eligible for live reload.
- Type: `int`
- Default: `30000`

### `AvailableMemory`

AvailableMemory is the amount of system memory available to the Refinery process.

This value will typically be set through an environment variable controlled by the container or deploy script.
If this value is zero or not set, then `MaxMemory` cannot be used to calculate the maximum allocation and `MaxAlloc` will be used instead.
If set, then this must be a memory size.
64-bit values are supported.
Sizes with standard unit suffixes (`MB`, `GiB`, etc.) and Kubernetes units (`M`, `Gi`, etc.) are also supported.
If set, `Collections.MaxAlloc` must not be defined.
Sizes with standard unit suffixes (`MB`, `GiB`, etc.) and Kubernetes units (`M`, `Gi`, etc.) are supported.
Fractional values with a suffix are supported.
If `AvailableMemory` is set, `Collections.MaxAlloc` must not be defined.

- Eligible for live reload.
- Type: `memorysize`
- Example: `4Gb`
- Example: `4.5Gb`
- Environment variable: `REFINERY_AVAILABLE_MEMORY`
- Command line switch: `--available-memory`

Expand All @@ -813,8 +853,8 @@ Useful values for this setting are generally in the range of 70-90.
MaxAlloc is the maximum number of bytes that should be allocated by the Collector.

If set, then this must be a memory size.
64-bit values are supported.
Sizes with standard unit suffixes (`MB`, `GiB`, etc) and Kubernetes units (`M`, `Gi`, etc) are also supported.
Sizes with standard unit suffixes (`MB`, `GiB`, etc.) and Kubernetes units (`M`, `Gi`, etc.) are supported.
Fractional values with a suffix are supported.
See `MaxMemory` for more details.
If set, `Collections.AvailableMemory` must not be defined.

Expand Down Expand Up @@ -852,6 +892,7 @@ If this happens, then you should increase this buffer size.
## Specialized Configuration

`Specialized` contains special-purpose configuration options that are not typically needed.

### `EnvironmentCacheTTL`

EnvironmentCacheTTL is the duration for which environment information is cached.
Expand Down Expand Up @@ -1117,4 +1158,3 @@ If this duration is `0`, then Refinery will not start in stressed mode, which wi
- Eligible for live reload.
- Type: `duration`
- Default: `3s`

48 changes: 48 additions & 0 deletions config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,54 @@ func TestMaxAlloc(t *testing.T) {
assert.Equal(t, expected, inMemConfig.MaxAlloc)
}

func TestPeerAndIncomingQueueSize(t *testing.T) {
testcases := []struct {
name string
configYAML string
expectedForPeer int
expectedForIncoming int
}{
{
name: "default",
configYAML: makeYAML("General.ConfigurationVersion", 2, "Collection.CacheCapacity", 1000),
expectedForPeer: 3000,
expectedForIncoming: 3000,
},
{
name: "PeerInMemoryCapacity is set",
configYAML: makeYAML("General.ConfigurationVersion", 2, "Collection.CacheCapacity", 1000, "Collection.PeerQueueSize", 4000),
expectedForPeer: 4000,
expectedForIncoming: 3000,
},
{
name: "IncomingInMemoryCapacity is set",
configYAML: makeYAML("General.ConfigurationVersion", 2, "Collection.CacheCapacity", 1000, "Collection.IncomingQueueSize", 4000),
expectedForPeer: 3000,
expectedForIncoming: 4000,
},
{
name: "below the minimum",
configYAML: makeYAML("General.ConfigurationVersion", 2, "Collection.CacheCapacity", 1000, "Collection.PeerQueueSize", 2000, "Collection.IncomingQueueSize", 2000),
expectedForPeer: 3000,
expectedForIncoming: 3000,
},
}

for _, tc := range testcases {
rm := makeYAML("ConfigVersion", 2)
config, rules := createTempConfigs(t, tc.configYAML, rm)
defer os.Remove(rules)
defer os.Remove(config)
c, err := getConfig([]string{"--no-validate", "--config", config, "--rules_config", rules})
assert.NoError(t, err)

inMemConfig, err := c.GetCollectionConfig()
assert.NoError(t, err)
assert.Equal(t, tc.expectedForPeer, inMemConfig.GetPeerQueueSize())
assert.Equal(t, tc.expectedForIncoming, inMemConfig.GetIncomingQueueSize())
}
}

func TestAvailableMemoryCmdLine(t *testing.T) {
cm := makeYAML("General.ConfigurationVersion", 2, "Collection.CacheCapacity", 1000, "Collection.AvailableMemory", 2_000_000_000)
rm := makeYAML("ConfigVersion", 2)
Expand Down
24 changes: 24 additions & 0 deletions config/file_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -173,18 +173,42 @@ type RedisPeerManagementConfig struct {
type CollectionConfig struct {
// CacheCapacity must be less than math.MaxInt32
CacheCapacity int `yaml:"CacheCapacity" default:"10_000"`
PeerQueueSize int `yaml:"PeerQueueSize"`
IncomingQueueSize int `yaml:"IncomingQueueSize"`
AvailableMemory MemorySize `yaml:"AvailableMemory" cmdenv:"AvailableMemory"`
MaxMemoryPercentage int `yaml:"MaxMemoryPercentage" default:"75"`
MaxAlloc MemorySize `yaml:"MaxAlloc"`
}

// GetMaxAlloc returns the maximum amount of memory to use for the cache.
// If AvailableMemory is set, it uses that and MaxMemoryPercentage to calculate
func (c CollectionConfig) GetMaxAlloc() MemorySize {
if c.AvailableMemory == 0 || c.MaxMemoryPercentage == 0 {
return c.MaxAlloc
}
return c.AvailableMemory * MemorySize(c.MaxMemoryPercentage) / 100
}

// GetPeerBufferCapacity returns the capacity of the in-memory channel for peer traces.
// If PeerBufferCapacity is not set, it uses 3x the cache capacity.
// The minimum value is 3x the cache capacity.
func (c CollectionConfig) GetPeerQueueSize() int {
if c.PeerQueueSize == 0 || c.PeerQueueSize < c.CacheCapacity*3 {
return c.CacheCapacity * 3
}
return c.PeerQueueSize
}

// GetIncomingBufferCapacity returns the capacity of the in-memory channel for incoming traces.
// If IncomingBufferCapacity is not set, it uses 3x the cache capacity.
// The minimum value is 3x the cache capacity.
func (c CollectionConfig) GetIncomingQueueSize() int {
if c.IncomingQueueSize == 0 || c.IncomingQueueSize < c.CacheCapacity*3 {
return c.CacheCapacity * 3
}
return c.IncomingQueueSize
}

type BufferSizeConfig struct {
UpstreamBufferSize int `yaml:"UpstreamBufferSize" default:"10_000"`
PeerBufferSize int `yaml:"PeerBufferSize" default:"100_000"`
Expand Down
24 changes: 24 additions & 0 deletions config/metadata/configMeta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -982,6 +982,30 @@ groups:
many multiples (100x to 1000x) of the total number of concurrently
active traces (trace throughput * trace duration).
- name: PeerQueueSize
type: int
default: 30_000
valuetype: nondefault
reload: false
summary: is the maximum number of in-flight spans redirected from other peers stored in the peer span queue.
description: >
The peer span queue serves as a buffer for spans redirected from other peers before they are processed.
In the event that this queue reaches its capacity, any subsequent spans will be discarded. The size of this
queue is contingent upon the number of peers within the cluster. Specifically, with N peers, the queue's span
capacity is determined by (N-1)/N of the total number of spans.
Its minimum value should be at least three times the CacheCapacity.
- name: IncomingQueueSize
type: int
default: 30_000
valuetype: nondefault
reload: false
summary: is the number of in-flight spans to keep in the incoming span queue.
description: >
The incoming span queue is used to buffer spans before they are processed.
If this queue fills up, then subsequent spans will be dropped.
Its minimum value should be at least three times the CacheCapacity.
- name: AvailableMemory
type: memorysize
valuetype: memorysize
Expand Down
38 changes: 32 additions & 6 deletions refinery_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -762,20 +762,46 @@ The number of traces in the cache should be many multiples (100x to 1000x) of th
- Type: `int`
- Default: `10000`

### `PeerQueueSize`

`PeerQueueSize` is the maximum number of in-flight spans redirected from other peers stored in the peer span queue.

The peer span queue serves as a buffer for spans redirected from other peers before they are processed.
In the event that this queue reaches its capacity, any subsequent spans will be discarded.
The size of this queue is contingent upon the number of peers within the cluster.
Specifically, with N peers, the queue's span capacity is determined by (N-1)/N of the total number of spans.
Its minimum value should be at least three times the CacheCapacity.

- Not eligible for live reload.
- Type: `int`
- Default: `30000`

### `IncomingQueueSize`

`IncomingQueueSize` is the number of in-flight spans to keep in the incoming span queue.

The incoming span queue is used to buffer spans before they are processed.
If this queue fills up, then subsequent spans will be dropped.
Its minimum value should be at least three times the CacheCapacity.

- Not eligible for live reload.
- Type: `int`
- Default: `30000`

### `AvailableMemory`

`AvailableMemory` is the amount of system memory available to the Refinery process.

This value will typically be set through an environment variable controlled by the container or deploy script.
If this value is zero or not set, then `MaxMemory` cannot be used to calculate the maximum allocation and `MaxAlloc` will be used instead.
If set, then this must be a memory size.
64-bit values are supported.
Sizes with standard unit suffixes (`MB`, `GiB`, etc.) and Kubernetes units (`M`, `Gi`, etc.) are also supported.
If set, `Collections.MaxAlloc` must not be defined.
Sizes with standard unit suffixes (`MB`, `GiB`, etc.) and Kubernetes units (`M`, `Gi`, etc.) are supported.
Fractional values with a suffix are supported.
If `AvailableMemory` is set, `Collections.MaxAlloc` must not be defined.

- Eligible for live reload.
- Type: `memorysize`
- Example: `4Gb`
- Example: `4.5Gb`
- Environment variable: `REFINERY_AVAILABLE_MEMORY`
- Command line switch: `--available-memory`

Expand All @@ -798,8 +824,8 @@ Useful values for this setting are generally in the range of 70-90.
`MaxAlloc` is the maximum number of bytes that should be allocated by the Collector.

If set, then this must be a memory size.
64-bit values are supported.
Sizes with standard unit suffixes (`MB`, `GiB`, etc) and Kubernetes units (`M`, `Gi`, etc) are also supported.
Sizes with standard unit suffixes (`MB`, `GiB`, etc.) and Kubernetes units (`M`, `Gi`, etc.) are supported.
Fractional values with a suffix are supported.
See `MaxMemory` for more details.
If set, `Collections.AvailableMemory` must not be defined.

Expand Down
Loading

0 comments on commit b975a87

Please sign in to comment.