Skip to content

VAST 2020.12.16

Compare
Choose a tag to compare
@dominiklohmann dominiklohmann released this 16 Dec 20:54
1380a1b

We're happy to announce the monthly release 2020.12.16 of VAST. To ship a release right before Christmas, we skipped the VAST release in November in favor of a now 50% larger release, packed with a bunch of features and fixes.

Big thanks go out to Andreas Herz and Sascha Steinbiss from DCSO for providing invaluable feature feedback, performance numbers, and moving VAST closer towards general availability in Debian buster.

Learnings from Production-grade Deployments

For the months of November and December, we focused on bringing VAST into a production-ready state. The meat of the changes cover performance improvements, stability improvements, and deployment streamlining.

FlatBuffers Table Slices

Table slices, VAST's internal representation of batches of events, have received a major overhaul. We previously refactored the definition of persistent state as FlatBuffers, and in this release we continue to push the "builder pattern": this means that we create the data in a well-defined binary layout via FlatBuffers to later enjoy the benefits of direct memory-mapping at query time. VAST's store currently defines a Feather-like on-disk format, concatenating table slices into segments. At query time, we memory-map them and have random access to the data, thanks to Arrow and FlatBuffers. Additionally, this release enables versioning of table slice encodings: we can now update existing or add new encodings without introducing breaking changes.

Data Model Streamlining

The port type is no longer a first-class type. The new way to represent transport-layer ports relies on the basic type count instead. In the schema, VAST ships with a new alias type port = count to keep existing schema definitions intact. This makes the first step in an effort towards adding more semantics to types via composition and aliasing. In the medium term, we plan to roll out more domain-specific aliases to improve domain-specific reasoning.

However, this is a breaking change because the on-disk format and Arrow data representation changed. Queries with :port type extractors no longer work. Similarly, the syntax 53/udp no longer exists; use count syntax 53 instead. Since most port occurrences do not carry a known transport-layer type, and the type information exists typically in a separate field, removing port as native type streamlines the data model. You will be able to query all fields of type port in the future again once type aliases can be queried using the :T syntax.

The type registry now correctly handles changes in schemas that are not backwards compatible, i.e., renamed fields or changed types of existing fields, and warns when detecting such a change.

Import processes now always use the most recent version of a type that is available, and do no longer require the server process to restart so that new versions of types are picked up. This makes for a much smoother experience in the presence of schema evolution.

Index Stability and Performance

For this release, we were focused on ironing out issues that came to light during our tests in preparation for upcoming large-scale deployments. In these tests, we ran VAST for several days or weeks, importing tens of thousands events per second of Suricata data. As expected, we discovered several issues after pushing past the limits we can test continuously during development.

We observed excessive memory usage, growing up to hundreds of gigabytes of RAM for databases in the multi-terabyte range. Not only was providing a machine with this amount a challenge, but this also caused near hour-long restart times, since the meta index has to be rebuilt on every restart.

The overall memory usage of bloom filters in meta index synopses was reduced by introducing an additional buffering step, and rewriting our bloom filters with optimal parameters when finalizing partitions. As it turns out, most of our bloom filters were very pessimistically sized, so this reduced the startup time by and overall memory usage by up to 95%!

On the export side, the index again proved to be a source of trouble: Running too many parallel queries could crash the server process. False positives from the meta index bloom filter no longer causes index workers to stop working on queries, which caused the index to deadlock. Queries for a limited number of events did not always correctly drop further results when the query finishes early, leaving around zombie index workers that were slowing down the whole system.

In addition to fixing all of the above, we also introduced a new string synopsis in the meta index and reworked the logic to pre-select the number of partitions, making string (in)equality queries up to 30x faster.

Taxonomies Update

The past release introduced concepts. This release rounds off the taxonomy specification with models. Taxonomies offer a unified access layer to represent domain knowledge. Concepts abstract away the naming differences of different data formats, and models now make it possible to define domain-specific entities that are tuples, such as a network connection. For example, you can now query for network connections like this:

net.connection == <1.2.3.4, _, 4.3.2.1, _, _>

The model net.connection defines a 5-tuple of a given format. The query translates into a product of concepts, each of which resolve to the format-specific fields in a recursive process.

To implement this change, we need a new “meta query” capability: we added a new attribute extractor called #field that matches on the name of a field. For example, #field == "src_ip" returns all events whose layout contains a record field named src_ip. The model resolution process uses this extractor internally, but it is now available for general use as well.

To simplify taxonomy introspection of a running VAST instance, the new vast dump [concepts|models] command prints a list of registered concepts and models. See vast dump help for more information.

Changelog Highlights

As always, you can find the full technical scoop in our changelog.

⚡️ Breaking Changes

  • The on-disk format for table slices now supports versioning of table slice encodings. This breaking change makes it so that adding further encodings or adding new versions of existing encodings is possible without breaking again in the future. #1143 #1157 #1160 #1165
  • CAF-encoded table slices no longer exist. As such, the option vast.import.batch-encoding now only supports arrow and msgpack as arguments. #1142
  • The port type is no longer a first-class type. The new way to represent transport-layer ports relies on count instead. In the schema, VAST ships with a new alias type port = count to keep existing schema definitions in tact. However, this is a breaking change because the on-disk format and Arrow data representation changed. Queries with :port type extractors no longer work. Similarly, the syntax 53/udp no longer exists; use count syntax 53 instead. Since most port occurrences do not carry a known transport-layer type, and the type information exists typically in a separate field, removing port as native type streamlines the data model. #1187
  • The build configuration of VAST received a major overhaul. Inclusion of libvast in other procects via add_subdirectory(path/to/vast) is now easily possible. The names of all build options were aligned, and the new build summary shows all available options. #1175

⚠️ Changes

  • Installed schema definitions now reside in <datadir>/vast/schema/types, taxonomy definitions in <datadir>/vast/schema/taxonomy, and concept definitions in <datadir/vast/schema/concepts, as opposed to them all being in the schema directory directly. When overriding an existing installation, you may have to delete the old schema definitions by hand. #1194
  • The Suricata schemas received an overhaul: there now exist vlan and in_iface fields in all types. In addition, VAST ships with new types for ikev2, nfs, snmp, tftp, rdp, sip and dcerpc. The tls type gets support for the additional sni and session_resumed fields. #1237 #1176 #1180 #1186 @satta
  • VAST does not produce metrics by default any more. The option --disable-metrics has been renamed to --enable-metrics accordingly. #1137
  • VAST now listens on port 42000 instead of letting the operating system choose the port if the option vast.endpoint specifies an endpoint without a port. To restore the old behavior, set the port to 0 explicitly. #1170

🧬 Experimental Features

  • The expression language gained support for the #field meta extractor. It is the complement for #type and uses suffix matching for field names at the layout level. #1228
  • The query language now supports models. Models combine a list of concepts into a semantic unit that can be fulfiled by an event. If the type of an event contains a field for every concept in a model. Turn to the documentation for more information. #1185 #1228
  • VAST now ships with its own taxonomy and basic concept definitions for Suricata, Zeek, and Sysmon. #1135 #1150

🎁 Features

  • The storage required for index IP addresses has been optimized. This should result in significantly reduced memory usage over time, as well as faster restart times and reduced disk space requirements. #1172 #1200 #1216
  • Low-selectivity queries of string (in)equality queries now run up to 30x faster, thanks to more intelligent selection of relevant index partitions. #1214
  • The new dump command prints configuration and schema-related information. The initial implementation allows for printing all registered concepts as JSON via vast dump concepts. The new flag vast.dump.yaml or vast dump --yaml switches to YAML output. #1196
  • The new option vast.client-log-file enables client-side logging. By default, VAST only writes log files for the server process. #1132

🐞 Bug Fixes

  • vast import no longer stalls when it doesn't receive any data for more than 10 seconds. #1136
  • The vast status command does not collect status information from sources and sinks any longer. They were often too busy to respond, leading to a long delay before the command completed. #1234
  • The index no longer crashes when too many parallel queries are running. #1210
  • The index now correctly drops further results when queries finish early, thus improving the performance of queries for a limited number of events. #1209
  • The type registry now detects and handles breaking changes in schemas, e.g., when a field type changes or a field is dropped from record. #1195
  • VAST no longer starts if the specified config file does not exist. #1147