Skip to content

Commit

Permalink
update README.md and doc/design.md
Browse files Browse the repository at this point in the history
Signed-off-by: Ryotaro Banno <ryotaro.banno@gmail.com>
  • Loading branch information
ushitora-anqou committed Jan 29, 2024
1 parent b36b2a6 commit a1d7d69
Show file tree
Hide file tree
Showing 2 changed files with 112 additions and 2 deletions.
34 changes: 32 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,42 @@ IO latency of read.
TYPE: gauge

### `pie_create_probe_total`
The number of attempts that the creation of the Pod object and the creation of the container.
The number of attempts of the creation of the Pod object and the creation of the container.

TYPE: counter

### `pie_performance_probe_total`
The number of attempts that the creation of the Pod object and the creation of the container.
The number of attempts of performing the IO benchmarks.

TYPE: counter

### `pie_io_write_latency_on_mount_probe_seconds`

_Experimental metrics._ IO latency of write, benchmarked on mount-probe Pods.

TYPE: gauge

### `pie_io_read_latency_on_mount_probe_seconds`

_Experimental metrics._ IO latency of read, benchmarked on mount-probe Pods.

TYPE: gauge

### `pie_mount_probe_total`

_Experimental metrics._ The number of attempts of the creation of the mount-probe Pod object and the creation of the container.

TYPE: counter

### `pie_performance_on_mount_probe_total`

_Experimental metrics._ The number of attempts of performing the IO benchmarks on mount-probe Pods.

TYPE: counter

### `pie_provision_probe_total`

_Experimental metrics._ The number of attempts of the creation of the provision-probe Pod object and the creation of the container.

TYPE: counter

Expand Down
80 changes: 80 additions & 0 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,83 @@ Then, if the PV cannot be created due to some problems, the metric would not be
you would not realize that there are some problems.

Therefore, if the PV is not created within a certain time, `create_probe_total` counter with `on_time=false` is incremented so that you can notice the problem even when the PV creation is completely stopped.

### Experimental Architecture using provision-probe and mount-probe

The current probe checks that both a new provisioning of a PV and its mounting succeed on every Node.
This guarantee is sufficient but not necessary; although mounting an already provisioned PV should succeed on every node, it is sufficient that a new provisioning succeeds on at least one Node.

To address the above issue, the new architecture has the following two types of probes:
- provision-probe, which checks that a new provision succeeds; and
- mount-probe, which checks that a PV (possibly already provisioned) can be successfully mounted on each Node.

```mermaid
flowchart TB
Prometheus[Prometheus, <br>VictoriaMetrics] -->|scrape| controller
controller[controller]
controller -->|create| cronjobA[CronJob] -->|create| probeAA
controller -->|create| cronjobB[CronJob] -->|create| probeAB
controller -->|create| cronjobC[CronJob] -->|create| probeBA
controller -->|create| cronjobD[CronJob] -->|create| probeBB
controller -->|create| cronjobE[CronJob] -->|create| probeProvisionA
controller -->|create| cronjobF[CronJob] -->|create| probeProvisionB
probeAA -->|use| volumeA[(PersistentVolume)]
probeAB -->|use| volumeB[(PersistentVolume)]
probeBA -->|use| volumeC[(PersistentVolume)]
probeBB -->|use| volumeD[(PersistentVolume)]
probeProvisionA -->|use| volumeE[(Generic Ephemeral Volume)]
probeProvisionB -->|use| volumeF[(Generic Ephemeral Volume)]
probeAA -->|post metrics| controller
probeAB -->|post metrics| controller
probeBA -->|post metrics| controller
probeBB -->|post metrics| controller
probeProvisionA -->|post metrics| controller
probeProvisionB -->|post metrics| controller
subgraph NodeA
probeAA[mount-probe]
probeAB[mount-probe]
end
subgraph NodeB
probeBA[mount-probe]
probeBB[mount-probe]
end
probeProvisionA[provision-probe]
probeProvisionB[provision-probe]
volumeA -.-|related| storageclassA[StorageClass A]
volumeB -.-|related| storageclassB[StorageClass B]
volumeC -.-|related| storageclassA[StorageClass A]
volumeD -.-|related| storageclassB[StorageClass B]
volumeE -.-|related| storageclassA[StorageClass A]
volumeF -.-|related| storageclassB[StorageClass B]
%% This is a workaround to make volumeA and volumeB closer.
subgraph volumeAB [ ]
volumeA
volumeB
end
style volumeAB fill-opacity:0,stroke-width:0px
%% This is a workaround to make volumeC and volumeD closer.
subgraph volumeCD [ ]
volumeC
volumeD
end
style volumeCD fill-opacity:0,stroke-width:0px
```

Each probe works as follows:
- provision-probe:
1. The controller creates a provision-probe CronJob for each StorageClass.
2. The CronJob periodically creates a provision-probe Pod.
3. The Pod requests the creation of a Generic Ephemeral Volume via the related StorageClass.
4. The controller monitors the Pod creation events and measures how long it takes to create the Pod.
(This indirectly measures the time required for provisioning the volume.) Then it exposes the result as Prometheus metrics.
5. Once the provision-probe Pod is created, it immediately exits normally.
- mount-probe:
1. The controller creates a mount-probe CronJob and a PVC for each Node and StorageClass.
2. The CronJob periodically creates a mount-probe Pod.
3. If the PVC is not yet bound, the Pod requests to provision a PV via the related StorageClass. Then, the Pod mounts the PV.
4. The controller monitors the Pod creation events and measures how long it takes to create the Pod.
(This indirectly measures the time required for mounting the volume.) Then it exposes the result as Prometheus metrics.
5. Once the Pod is created, it tries to read and write data from and to the PV, and measures the I/O latency. Then it posts the result to the controller and exists normally.
6. When the controller receives the request from the mount-probe Pod, it exposes the result as Prometheus metrics.

0 comments on commit a1d7d69

Please sign in to comment.