Base structure for further work on ECS (elastic#1)

This PR is intended to provide a foundation for all upcoming work on the Elastic Common Schema. It contains some examples schemas, a script to generate the docs out of this example schema and a basic README.md. All follow up work will happen in PRs and work can be tracked in Github issue.
robgil · Nov 9, 2017 · add5c46 · add5c46
1 parent 5b8fa6d
commit add5c46
Show file tree

Hide file tree

Showing 10 changed files with 311 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/Makefile b/Makefile
@@ -0,0 +1,2 @@
+generate:
+	python generate.py
diff --git a/README.md b/README.md
@@ -1,2 +1,54 @@
-# ecs
-Elastic Common Schema
+**WARNING: THIS IS WORK IN PROGRESS**
+
+# Elastic Common Schema (ECS)
+
+This are the definitions of the Elastic Common Schema (ECS). The schemas are stored in a `.yml` in the directory `schemas`. Each namespace of the schema has it's own file. These files can be used to generate either the docs by running `make generate` or later also to create an Elasticsearch template or Kibana index pattern. It's mostly based on the `fields.yml` structure from Beats.
+
+## Rules
+
+Here come the rules on how to name and create the fields for ECS.
+
+## Docs
+
+The generate ECS documentation output can be found [here](./schema.md).
+
+## Fields
+
+`fields.yml` files are used to describe the Elastic Common Schema in a structured way. These files allow to generate an Elasticsearch index template, Kibana index pattern or documentation output out of it in an automated way.
+
+The structure of the of each document looks as following:
+
+```
+- namespace: agent
+  title: Agent fields
+  level: 2
+  description: >
+    The agent fields contain all the data about the agent/client/shipper that collected / generated the events.
+
+    As an example in case of beats for logs this is `agent.name` is `filebeat`.
+  fields:
+    - name: version
+      type: keyword
+      description: >
+        Agent version.
+      example: 6.0.0-rc2
+      phase: 0
+```
+
+Each namespace has it's own file to keep the files itself small. Each namespace contains a list of fields which has all the fields inside. `title` and `description`  are used to describe the namespace. `level` is for pure sorting purpose in the documentation output.
+
+Each field under `fields` has first the field `name`. The `type` is the [Elasticsearch field type](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html). `description` is used to add details about the field itself. With `example` an example value can be provided. The `phase` field is used to indicate in which `phase` the current field is (more about this below). In case `phase` is left out, it defaults to 0.
+
+## Phases
+
+The goal of the phase value for each field is to indicate if a field is already part of the standard or not. Different phases exist to make it easy to contribute new fields but still be able to iterate on top of it. The phases are defined as following:
+
+* 0 (alpha): The field is new and is up for discussion if it should be added. The field might be removed at any time again.
+* 1 (beta): It's clear that there is value of having the field in ECS and discussions about naming / namespaces etc. started. It's unlikely that the field is removed again but naming might change at any time.
+* 2 (rc): The field has been accepted and is unlikely to change. It is now tested in the field.
+* 3 (GA): The field is part of ECS and breaking changes to it happen only on major releases.
+
+## Links
+
+* Foundation: https://docs.google.com/spreadsheets/d/1RUS-nwMLaU4U9YistexTCG3EcHzeY-L2p3V0iX0G9h0/edit#gid=1862780542
+* Beats draft: https://docs.google.com/document/d/1pmzli3x33AQbhyqxpd024ggi3r9KuAfGBFDS4vYHWw0/edit
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,18 @@
+TODO
+
+* Decide on schema name (Elastic Common Schema)
+* Decide on naming rules
+* Get in all schema fields
+* Document example of field structure
+* Document verified fields
+* How can we link solutions and format
+  * Format should not know about solutions, but link it back
+  * Create solutions pages?
+* Introduce phase key:
+  * 0: proposal for new field (default)
+  * 1: accepted as necessary field, figuring out naming
+  * 2: accepted as field, tesing in field, same should stay
+  * 3: verified ecs field. all breaking changes should only happen in major versions
+* Goal of Phases: Make it easy to contribute new ideas and then iterate on top of it
+* Describe fields.yml in README
+* Level is only used for sorting, no logical meaning
diff --git a/generate.py b/generate.py
@@ -0,0 +1,82 @@
+import yaml
+import os
+import argparse
+
+if __name__ == "__main__":
+
+    # Load schema files into yaml
+    files = os.listdir("schemas")
+    content = ""
+    for file in os.listdir("schemas"):
+        with open("schemas/" + file) as f:
+            content = content + f.read()
+
+    # Load all fields into object
+    fields = yaml.load(content)
+    sortedNamespaces = sorted(fields, key=lambda field: field["level"])
+
+    # Create markdown schema output file
+    output = open("schema.md", 'w')
+
+    for namespace in sortedNamespaces:
+        output.write("# " + namespace["title"] + "\n\n")
+
+        # Replaces one newlines with two as otherwise double newlines do not show up in markdown
+        output.write(namespace["description"].replace("\n", "\n\n") + "\n")
+
+        titles = ["Field", "Description", "Type", "Phase", "Example"]
+
+        for title in titles:
+            output.write("| {}  ".format(title))
+        output.write("|\n")
+
+        for title in titles:
+            output.write("|---")
+        output.write("|\n")
+
+        # Sort fields for easier readability
+        namespaceFields = sorted(namespace["fields"], key=lambda field: field["name"])
+
+        # Print fields into a table
+        for field in namespaceFields:
+            description = ""
+            if 'description' in field.keys():
+                # Remove all spaces and newlines from beginning and end
+                description = field["description"].strip()
+
+                # Replace newlines with HTML representation as otherwise newlines don't work in Markdown
+                description = description.replace("\n", "<br/>")
+
+            example = ""
+            if 'example' in field.keys():
+                # Remove all spaces and newlines from beginning and end
+                example = field["example"].strip()
+
+            type = ""
+            if 'type' in field.keys():
+                # Remove all spaces and newlines from beginning and end
+                type = field["type"].strip()
+
+            field_name = field["name"]
+
+
+
+            # Prefix if not base namespace
+            if namespace["namespace"] != "base":
+                field_name = namespace["namespace"] + "." + field_name
+
+            # Verified and accepted fields are bold
+            verified = False
+            if 'verified' in field.keys() and field["verified"]:
+                field_name = "**" + field_name + "**"
+
+            phase = 0
+            if 'phase' in field.keys():
+                # Remove all spaces and newlines from beginning and end
+                phase = field["phase"]
+
+            output.write("| {}  | {}  | {}  | {}  | {}  |\n".format(field_name, description, type, phase, example))
+
+        output.write("\n\n")
+
+    output.close()
diff --git a/schema.md b/schema.md
@@ -0,0 +1,52 @@
+# Base
+
+The base namespace contains all fields from which are on the top level without a namespace. These are fields which are common across all types of events.
+
+
+| Field  | Description  | Type  | Phase  | Example  |
+|---|---|---|---|---|
+| id  | Unique id to describe the event.  | keyword  | 1  | 8a4f500d  |
+| timestamp  | Timestamp when the event was created.<br/>For log events this is expected to be when the event was generated and not when it was read.  | date  | 1  | 2016-05-23T08:05:34.853Z  |
+
+
+# Agent fields
+
+The agent fields contain all the data about the agent/client/shipper that collected / generated the events.
+
+As an example in case of beats for logs this is `agent.name` is `filebeat`.
+
+
+| Field  | Description  | Type  | Phase  | Example  |
+|---|---|---|---|---|
+| agent.id  | Unqiue identifier of this agent if one exists.<br/>In the case of beats this would be beat.id.  | keyword  | 0  | 8a4f500d  |
+| agent.name  | Agent name.<br/>Name of the agent.  | keyword  | 0  | filebeat  |
+| agent.version  | Agent version.  | keyword  | 0  | 6.0.0-rc2  |
+
+
+# Host fields
+
+All fields related to a host. A host can be a physical machine, a virtual machine but also a docker container.
+
+Normally the host information is related to the machine on which the event was generated / collected but also can be be used differently if needed.
+
+
+| Field  | Description  | Type  | Phase  | Example  |
+|---|---|---|---|---|
+| host.id  | Unique host id.<br/>As hostname is not always unique, this often can be configured by the user. An example here is the current usage of `beat.name`.  | keyword  | 1  |   |
+| host.name  | Name of the host  | keyword  | 1  |   |
+| host.timezone  | Timezone of the host  | date  | 1  |   |
+
+
+# Elasticsearch fields
+
+Common fields for Elasticsearch metrics and logs
+
+
+| Field  | Description  | Type  | Phase  | Example  |
+|---|---|---|---|---|
+| elasticsearch.cluster.id  | Elasticsearch cluster id  | keyword  | 1  |   |
+| elasticsearch.cluster.name  | Elasticsearch cluster name  | keyword  | 1  |   |
+| elasticsearch.node.name  | Elasticsearch node name  | keyword  | 1  |   |
+| elasticsearch.node.version  | Elasticsearch node version  | keyword  | 1  |   |
+
+
diff --git a/schemas/agent.yml b/schemas/agent.yml
@@ -0,0 +1,28 @@
+- namespace: agent
+  title: Agent fields
+  level: 2
+  description: >
+    The agent fields contain all the data about the agent/client/shipper that collected / generated the events.
+
+    As an example in case of beats for logs this is `agent.name` is `filebeat`.
+  fields:
+    - name: version
+      type: keyword
+      description: >
+        Agent version.
+
+      example: 6.0.0-rc2
+    - name: name
+      type: keyword
+      description: >
+        Agent name.
+
+        Name of the agent.
+      example: filebeat
+    - name: id
+      type: keyword
+      description: >
+        Unqiue identifier of this agent if one exists.
+
+        In the case of beats this would be beat.id.
+      example: 8a4f500d
diff --git a/schemas/base.yml b/schemas/base.yml
@@ -0,0 +1,21 @@
+- namespace: base
+  title: Base
+  level: 1
+  description: >
+    The base namespace contains all fields from which are on the top level without a namespace.
+    These are fields which are common across all types of events.
+  fields:
+    - name: id
+      type: keyword
+      description: >
+        Unique id to describe the event.
+      example: 8a4f500d
+      phase: 1
+    - name: timestamp
+      type: date
+      phase: 1
+      example: "2016-05-23T08:05:34.853Z"
+      description: >
+        Timestamp when the event was created.
+
+        For log events this is expected to be when the event was generated and not when it was read.
diff --git a/schemas/elasticsearch.yml b/schemas/elasticsearch.yml
@@ -0,0 +1,26 @@
+- namespace: elasticsearch
+  title: Elasticsearch fields
+  level: 3
+  description: >
+    Common fields for Elasticsearch metrics and logs
+  fields:
+    - name: cluster.id
+      type: keyword
+      description: >
+        Elasticsearch cluster id
+      phase: 1
+    - name: cluster.name
+      type: keyword
+      description: >
+        Elasticsearch cluster name
+      phase: 1
+    - name: node.version
+      type: keyword
+      description: >
+        Elasticsearch node version
+      phase: 1
+    - name: node.name
+      type: keyword
+      description: >
+        Elasticsearch node name
+      phase: 1
diff --git a/schemas/host.yml b/schemas/host.yml
@@ -0,0 +1,27 @@
+- namespace: host
+  title: Host fields
+  level: 2
+  description: >
+    All fields related to a host. A host can be a physical machine, a virtual machine but also a docker container.
+
+    Normally the host information is related to the machine on which the event was generated / collected but also can be
+    be used differently if needed.
+  fields:
+    - name: timezone
+      type: date
+      description: >
+        Timezone of the host
+      phase: 1
+    - name: name
+      type: keyword
+      description: >
+        Name of the host
+      phase: 1
+    - name: id
+      type: keyword
+      phase: 1
+      description: >
+        Unique host id.
+
+        As hostname is not always unique, this often can be configured by the user.
+        An example here is the current usage of `beat.name`.