Skip to content

GoFS GML Parsing

Soonil Nagarkar edited this page Jul 5, 2013 · 15 revisions

GML Specification

The original GML specification can be found at GML Technical Report. The rest of this document assumes familiarity with the specification and will treat everything in the specification as an implicit assumption.

GML offers the following advantages in representing graphs, and time series graphs:

  • Simple
  • Portable
  • Hierarchical Key-Value Pairs
  • Extensible
  • Flexible

GoFS Extensions

GoFS uses a set of extensions to GML as specified above to support our own custom format for time series graphs. These may be loaded into GFS through the use of the GMLPartition class. A set of GML files with GoFS extensions to represent a time series graph is hence forth referred to as a GML Partition. A GML Partition consists first of a single template GML file, which contains the structure of the graph, a list of properties associated with vertices, a list of properties associated with edges, and default values for each property which may be set for individual vertices and edges. A GML Partition consists second of a set of instance GML files, which contains the instance id, timestamps, and contains a structure identical to that found in the template file, but which specifies specific values for vertex properties and edge properties for each instance which will override the defaults.

In addition, the GoFS parser is a bit more flexible than an exact GML parser, but we encourage you to conform to the GML specifications, as there is no guarantee we will not restrict the GML parser at a later date. As an example, our parser currently does not enforce maximum line lengths and maximum key lengths, nor the exact format of keys and strings. Keep in mind that all GML files must be encoded with ISO-8859-1. The GoFS implementation of GML supports all HTML 4 character escapes, however, if possible we recommend you avoid using character escape sequences, as it can slow the processing of GML files substantially.

NOTE: If the GML parser encounters a key or value longer than the buffer size, it will attempt to double it's buffer size and try again until the key or value fits. The default buffer size is 8192 characters, which should be sufficient for all but the longest values. The GML standard does enforce a line length, so no value should be longer in than this line length in any case, although this is not enforced.

NOTE: Most GML related classes accept either an InputStream or a File representing the GML to be parsed. The InputStream should be a raw input stream, as the GMLParser will take care of buffering. If a File is passed in, the file will be memory mapped and read from as a buffer. This should generally see better performance for larger files.

Template File

The template file conforms to the graph keys and specifications as laid out in the technical report, with the following extensions. The template file should contain the following keys as children of the graph key.

  • directed [optional] - 1 or 0 to set the directedness of the graph. If not found will be assumed undirected.

  • vertex_properties [optional] - A list of properties associated with vertices. Within vertex properties, each property name is a key, with a list value which must contain the following keys:

    • is_static [required] - 1 or 0 to set whether a property is static or not.

    • type [required] - A string defining the type of the property, may be one of ["string", "integer", "long", "float", "double", "boolean"] which represent the corresponding Java types. GML types (Integer, Real, String) are converted into one of these types as follows.

      • string : Integer, Real, or String
      • integer : Integer
      • long : Integer
      • float : Real
      • double : Real
      • boolean : Integer, or Real (a value equal to 1.0 is true, any other value is false)
  • edge_properties [optional] - A list of properties associate with edges, with same format as vertex properties.

  • node [optional][repeated] - A list of properties for a particular vertex.

    • id [required] - An integer representing the vertex id, which should be unique across vertices.
    • remote [optional] - An integer representing the partition this remote vertex may be found on. This specifies that this vertex is remote, and may not have any property values specified, as it is not actually owned by this partition.
    • any other keys [optional] - Any other keys are checked to see if they share the name of a vertex property, in which case the value of the key is treated as the default value of that property for this vertex.
  • edge [optional][repeated] - A list of properties for a particular edge.

    • id [required] - An integer representing the edge id, which should be unique across edges.
    • source [required] - An integer representing the vertex id of the source of this edge.
    • target [required] - An integer representing the vertex id of the sink of this edge.
    • any other keys [optional] - Any other keys are checked to see if they share the name of an edge property, in which case the value of the key is treated as the default value of that property for this edge.

Instance Files

In addition to the template file, a GML Partition is formed through a set of instance files. Each file has a similar structure to the template file, but it specifies the values of various properties at a specified instance, rather than default values. It should be noted that while the instance file appears to use a similar layout as specified for graphs in the technical report, an instance file is not sufficient to recover the structure of a graph. Rather , an instance file is only sufficient for providing values for properties to vertices and edges. An instance file should contain the following keys as children of the graph key.

  • id [required] - An integer representing the instance id of this instance.

  • timestamp_start [required] - An integer representing the timestamp for the beginning of this instance.

  • timestamp_end [required] - An integer representing the timestamp for the end of this instance.

  • node [optional][repeated] - Uses the same specification as in the template file, but property values are treated as specific values for this instance which override the default values.

  • edge [optional][repeated] - Uses the same specification as in the template file, but property values are treated as specific values for this instance which override the default values.

Example Template File

Note that while vertices and edges may have properties with the same name, these are NOT the same property and should be treated entirely separately. To avoid confusion, we recommend avoiding having vertex and edge properties with the same name.

graph [
  directed 1
  vertex_properties [
    property_one [
      is_static 0
      type "string"
    ]
    property_two [
      is_static 0
      type "integer"
    ]
  ]
  edge_properties [
    property_one [
      is_static 0
      type "string"
    ]
    property_two [
      is_static 0
      type "boolean"
    ]
  ]
  node [
    id 1
    property_one "default value 1"
    property_two 1439
  ]
  node [
    id 2
    property_one "default value 2"
  ]
  node [
    id 3
    remote 100
  ]
  edge [
    id 1
    source 1
    target 2
    property_one "default value 3"
  ]
  edge [
    id 2
    source 2
    target 3
    property_two 1
  ]
]

Example Instance File

graph [
  id 1
  timestamp_start 1035
  timestamp_end 2036
  node [
    id 1
  ]
  node [
    id 2
    property_two 9845
  ]
  edge [
    id 1
    source 1
    target 2
    property_two 0
  ]
]

Using GMLPartition In Code

GMLPartition.parseGML accepts as arguments a stream for the template file, and an Iterable of streams for the instance files. For your convenience we have provided a helper class GMLFileIterable, which wraps an Iterable of Paths and represents it as an Iterable of InputStreams. This is accomplished by using each Path to create a FileInputStream on demand. An example usage pattern is shown below.

The first argument to GMLPartition.parseGML is the id of the partition that will be created.

InputStream templateStream = new FileInputStream(new File("~/Partition/Template.gml"));

File files[] = new File("~/Partition/Instances/").listFiles();
List<Path> paths = new ArrayList<>(files.length);
for (File instance : files) {
    paths.add(Paths.get(instance.toURI()));
}

GMLPartition partition = GMLPartition.parseGML(1, templateStream, new GMLFileIterable(paths));
Clone this wiki locally