Pod log forwarding to logit.io has been enabled within each cluster.
Filebeat runs in each node, and monitors for pods with the annotation "logit.io/send: true". Once identified, logs will be sent to the cluster BEATS_URL which is contained in the corresponding cluster key vault.
Services that use terraform-modules can enable logit.io logging by adding "enable_logit: true" to app environments.
Enabling logit via the terraform-module will also disable sending logs to the logs analytics workspace for that environment, as the module will also add the annotation "fluentbit.io/exclude: true"
The account "Teacher Services UK" was created by Digital tools support, with the help of the Teacher services finance team to input the payment details.
The UK region must be selected at account creation time as all the ELK stacks created in the account will be in this region, and this cannot be changed later.
Digital tools support adds the users to the account. Request using the service portal.
We created 3 subscriptions for logs, one for each Azure subscription:
- TEACHER SERVICES CLOUD DEVELOPMENT
- For testing with dev clusters
- Plan: cheap plan for testing
- TEACHER SERVICES CLOUD TEST:
- For the platform-test and test clusters
- Plan: enough daily volume for the apps on the test cluster, but low retention
- TEACHER SERVICES CLOUD PRODUCTION:
- For the production cluster
- Plan: enough daily volume for the apps on the production cluster and 30 days retention
To create a new stack:
- Discuss the cost with the Teacher services finance team and if required get approval from the deputy director
- Members of the Administrators team can create a stack
- Select
ADD STACK
- Select the plan
- Choose monthly billing
- Rename:
Teacher Services Cloud <Environment>
- Add to plan: Logs
- Set daily volume and retention
- Click
ADD SUBSCRIPTION
- Configure logstash inputs
- Copy beats-SSL endpoint and remove any other input
- Add beats-SSL endpoint as keyvault secret "BEATS-URL" to the corresponding AKS cluster keyvault
- Delete extra indices
- Run terraform-kubernetes-apply for the cluster or clusters
- Annotate pods with
logit.io/send: "true"
to ship their logs. Use theenable_logit
variable for applications deployed with the application module. - Refresh index pattern
We have enabled Logit stack alerts and notification (elastalert). Each stack has a monitor for too many logs per hour, and no logs in 30 minutes. When triggered, an email alert will be sent to the TS Infra team email address, and we should investigate why there are too many or missing logs. It will re-alert every 3 hours until any issue is resolved.
Filebeat sends logs to logstash as json so they can be decoded to create fields in ElasticSearch and query them with Kibana.
We also ask all the applications deployed to the cluster to log using json output. The filebeat log contains a field message
that we decode using the logstash pipeline. And the new fields are stored under the app
key.
The logstash pipeline is stored here and must be kept up-to-date on all the stacks. It decodes the ingress controller logs so we can observe the HTTP traffic details.
Standard ECS fields are used as much as possible. This allows a single point of reference, correlation between different event types and reuse of queries and dashbords.
filter {
### Ingress controller logs ###
if [kubernetes][deployment][name] == "ingress-nginx-controller" {
# Container standard out stream
if [stream] == "stdout" {
# Decode message field
grok {
match => { "message" => ["%{IPORHOST:[source][ip]} - %{DATA:[url][username]} \[%{HTTPDATE:[ingress][time]}\] \"%{WORD:[http][request][method]} %{DATA:[url][original]} HTTP/%{NUMBER:[http][version]}\" %{NUMBER:[http][response][status_code]} %{NUMBER:[http][response][body][bytes]} \"%{DATA:[http][request][referrer]}\" \"%{DATA:[ingress][agent]}\" %{NUMBER:[http][request][bytes]} %{NUMBER:[ingress][request_time]} \[%{DATA:[ingress][proxy][upstream][name]}\] \[%{DATA:[ingress][proxy][alternative_upstream_name]}\] %{NOTSPACE:[ingress][upstream][addr]} %{NUMBER:[ingress][upstream][response][length]} %{NUMBER:[ingress][upstream][response][time]} %{NUMBER:[ingress][upstream][status]} %{NOTSPACE:[http][request][id]}"] }
# Debug: Comment this line to keep the original message
remove_field => "message"
}
# Use time from ingress access log as log @timestamp
date {
match => [ "[ingress][time]", "dd/MMM/YYYY:H:m:s Z" ]
remove_field => "[ingress][time]"
}
# Parse User agent into ECS fields
useragent {
source => "[ingress][agent]"
ecs_compatibility => "v8"
remove_field => "[ingress][agent]"
}
# Use geoip to find location of IP address
# If the field ends with [ip], the filter will use the parent (here [source]) as a target
geoip {
source => "[source][ip]"
ecs_compatibility => "v8"
}
# Strip query strings as there may be personal data
mutate {
gsub => ["[url][original]", "\?.*", "?<QUERY STRING STRIPPED>"]
}
mutate {
gsub => ["[http][request][referrer]", "\?.*", "?<QUERY STRING STRIPPED>"]
}
}
# Container standard error stream
else if [stream] == "stderr" {
# Decode message field
grok {
match => { "message" => ["%{DATA:[ingress][time]} \[%{DATA:[log][level]}\] %{NUMBER:[ingress][pid]}#%{NUMBER:[ingress][tid]}: (\*%{NUMBER:[ingress][connection_id]} )?%{GREEDYDATA:[ingress][message]}"] }
# Debug: Comment this line to keep the original message
remove_field => "message"
}
# Use time from ingress error log as log @timestamp
date {
match => [ "[ingress][time]", "YYYY/MM/dd H:m:s" ]
remove_field => "[ingress][time]"
}
# Recreate message field
mutate {
rename => { "[ingress][message]" => "message" }
}
}
}
### Other logs ###
# If message looks like json, decode it and store under the app key
else if [message] =~ /^{.*}/ {
json {
source => "message"
target => "app"
# Debug: Comment this line to keep the original message
remove_field => ["message"]
}
# Remove stack trace for 404 errors in rails apps, as it is large and adds no value
if [app][exception][name] == "ActionController::RoutingError" {
mutate {
remove_field => "[app][exception][stack_trace]"
}
}
# Encode HTTP params as json string to avoid indexing thousands of fields
json_encode {
source => "[app][payload][params]"
target => "[app][payload][params_json]"
# Debug: Comment this line to keep the original object
remove_field => "[app][payload][params]"
}
# current_user_id may be a number or a UUID. Enforce string type. Used by ECF and NPQ
if [app][payload][current_user_id] {
mutate {
convert => { "[app][payload][current_user_id]" => "string" }
}
}
# Standardise field names with ECS: https://www.elastic.co/guide/en/ecs/current/index.html
## Ruby apps log mutate start
mutate {
rename => { "[app][payload][status]" => "[http][response][status_code]" }
}
mutate {
rename => { "[app][payload][method]" => "[http][request][method]" }
}
mutate {
rename => { "[app][payload][format]" => "[http][response][mime_type]" }
}
mutate {
rename => { "[app][payload][path]" => "[url][path]" }
}
## Ruby apps log mutate end
## .Net apps log mutate start
mutate {
rename => { "[app][Method]" => "[http][request][method]" }
}
mutate {
rename => { "[app][StatusCode]" => "[http][response][status_code]" }
}
mutate {
rename => { "[app][RequestId]" => "[http][request][id]" }
}
mutate {
rename => { "[app][RequestPath]" => "[url][path]" }
}
## .Net apps log mutate end
}
}
The following tool is useful for debugging Grok expressions: https://grokconstructor.appspot.com/
- Multiple Upstream Responses
- URI Too Long
When logs are ingested and contain new fields, it may be necessary to refresh the index pattern as non indexed fields cannot be queried. You can see the field is not indexed if there is a warning sign on the log.
- Go to Kibana (From the dashboard, click
LAUNCH LOGS
) - From the left menu select
Dashboards Management
- Select
Index patterns
- Select
*-*
- Click the "Refresh field list" icon
- The number of fields should change
We get an alert if we send more data than the plan allows by one or more of our services to the logit stacks. Usually this is because one of the services is generating an excessive number of log messages. To determine the service:
- Go to Kibana (From the dashboard, click
LAUNCH LOGS
) for the affected logit stack. - In the left hand of the search page find the field kubernetes.deployment.name.
- Click on the magnify glass icon next to the field and a popup with show the 5 top values.
- Then we need to message the developers for the service for the top value and get them to look at why they are generating an excessive number of messages.
An index mapping is created based on the field types of all the ElastiSearch indices (there is one per day). If a field has a different type in 2 different indices, it creates a mapping conflict and logs may be rejected. Rejected logs will be stored in the Dead letter queue.
To see which fields are in conflict:
- In kibana, open the left menu
- Select
Dashboards Management
- Select
Index patterns
- Select
*-*
- There will be a warning message. For more details, in the dropdown menu select
conflict
and it will show which fields are in conflict. - For each one, you can see which index has each type. The first log of the day determines the type of the field for the whole day.
To fix a conflict, make sure the all the logs send the right field types, then delete the indices with the wrong type. Or contact Logit.io support to reindex the logs.
Logs collected by filebeat are stored in daily index filebeat-<date>
. Other indices may be created with different fields and may cause mapping conflicts.
- In Kibana, select
Dev Tools
in the left menu - List indices:
GET /_cat/indices/
- Delete opensearch-sap-log-types-config index:
DELETE /.opensearch-sap-log-types-config
- Delete logstash indices:
DELETE /logstash*