-
Notifications
You must be signed in to change notification settings - Fork 11
GoFS GML Parsing
The original GML specification can be found at GML Technical Report. The rest of this document assumes familiarity with the specification and will treat everything in the specification as an implicit assumption.
GML offers the following advantages in representing graphs, and time series graphs:
- Simple
- Portable
- Hierarchical Key-Value Pairs
- Extensible
- Flexible
GoFS uses a set of extensions to GML as specified above to support our own custom format for time series graphs. These may be loaded into GFS through the use of the GMLPartition class. A set of GML files with GoFS extensions to represent a time series graph is hence forth referred to as a GML Partition. A GML Partition consists first of a single template GML file, which contains the structure of the graph, a list of properties associated with vertices, a list of properties associated with edges, and default values for each property which may be set for individual vertices and edges. A GML Partition consists second of a set of instance GML files, which contains the instance id, timestamps, and contains a structure identical to that found in the template file, but which specifies specific values for vertex properties and edge properties for each instance which will override the defaults.
In addition, the GoFS parser is a bit more flexible than an exact GML parser, but we encourage you to conform to the GML specifications, as there is no guarantee we will not restrict the GML parser at a later date. As an example, our parser currently does not enforce maximum line lengths and maximum key lengths, nor the exact format of keys and strings. The GoFS implementation of GML supports all HTML 4 character escapes, but as GoFS supports UTF8, this is usually unnecessary. Unescaping HTML character escapes is a relatively expensive operation, so to retain performance, all character escapes are performed lazily, and only when the string value in question is actually accessed. Reading in GML and then writing the same back out will not incur an unescape/escape penalty. However, writing out hand-constructed GML will incur an escape penalty if there are characters that need to be escaped.
There are a couple differences from the specification as detailed here:
- No limitation on key length, value length, or line length.
- We support 64 bit signed long values instead of 32 bit integers.
- GML is decoded/encoding from/to UTF-8, NOT ISO-8859-1 as the GML specification states. UTF-8 is NOT a complete superset of ISO-8859-1, but for the ASCII subset of UTF-8, it is a superset. Thus, GoFS implicitly supports ISO-88591-1 as long as the character range is limited to ASCII. Given that the vast majority of documents these days are encoded with UTF-8 however, this design decision means a lot less translating of characters outside ISO-8859-1. As a result, HTML character escapes are only necessary for the ampersand ('&' -> '&') and quote ('"' -> '"') characters, characters outside the range of ISO-8858-1 can be properly encoded in UTF8.
NOTE: If the GML parser encounters a key or value longer than the buffer size, it will attempt to double it's buffer size and try again until the key or value fits. The default buffer size is 8192 characters, which should be sufficient for all but the longest values. The GML standard does enforce a line length, so no value should be longer in than this line length in any case, although this is not enforced.
The template file conforms to the graph keys and specifications as laid out in the technical report, with the following extensions. The template file should contain the following keys as children of the graph key.
-
directed [optional] - 1 or 0 to set the directedness of the graph. If not found will be assumed undirected.
-
vertex_properties [optional] - A list of properties associated with vertices. Within vertex properties, each property name is a key, with a list value which must contain the following keys:
-
is_static [required] - 1 or 0 to set whether a property is static or not.
-
type [required] - A string defining the type of the property, may be one of ["string", "integer", "long", "float", "double", "boolean", "list"] which represent the corresponding Java types. GML types (Integer, Real, String) are converted into one of these types as follows.
- string : Integer, Real, or String
- integer : Integer
- long : Integer
- float : Real, Integer
- double : Real, Integer
- boolean : Real, Integer (a value equal to 1 is true, any other value is false)
- list : A sequence of key/value pairs, where the values are used as list elements, and keys are ignored
-
-
edge_properties [optional] - A list of properties associate with edges, with same format as vertex properties.
-
node [optional][repeated] - A list of properties for a particular vertex.
- id [required] - An integer representing the vertex id, which should be unique across vertices.
- remote [optional] - An integer representing the partition this remote vertex may be found on. This specifies that this vertex is remote, and may not have any property values specified, as it is not actually owned by this partition.
- any other keys [optional] - Any other keys are checked to see if they share the name of a vertex property, in which case the value of the key is treated as the default value of that property for this vertex.
-
edge [optional][repeated] - A list of properties for a particular edge.
- id [required] - An integer representing the edge id, which should be unique across edges.
- source [required] - An integer representing the vertex id of the source of this edge.
- target [required] - An integer representing the vertex id of the sink of this edge.
- any other keys [optional] - Any other keys are checked to see if they share the name of an edge property, in which case the value of the key is treated as the default value of that property for this edge.
In addition to the template file, a GML Partition is formed through a set of instance files. Each file has a similar structure to the template file, but it specifies the values of various properties at a specified instance, rather than default values. It should be noted that while the instance file appears to use a similar layout as specified for graphs in the technical report, an instance file is not sufficient to recover the structure of a graph. Rather , an instance file is only sufficient for providing values for properties to vertices and edges. An instance file should contain the following keys as children of the graph key.
-
id [required] - An integer representing the instance id of this instance.
-
timestamp_start [required] - An integer representing the timestamp for the beginning of this instance.
-
timestamp_end [required] - An integer representing the timestamp for the end of this instance.
-
node [optional][repeated] - Uses the same specification as in the template file, but property values are treated as specific values for this instance which override the default values.
-
edge [optional][repeated] - Uses the same specification as in the template file, but property values are treated as specific values for this instance which override the default values.
Note that while vertices and edges may have properties with the same name, these are NOT the same property and should be treated entirely separately. To avoid confusion, we recommend avoiding having vertex and edge properties with the same name. The example below contains a list value, which evaluates to the list [1, 2, 3].
graph [
directed 1
vertex_properties [
property_one [
is_static 0
type "string"
]
property_two [
is_static 0
type "integer"
]
]
edge_properties [
property_one [
is_static 0
type "list"
]
property_two [
is_static 0
type "boolean"
]
]
node [
id 1
property_one "default value 1"
property_two 1439
]
node [
id 2
property_one "default value 2"
]
node [
id 3
remote 100
]
edge [
id 1
source 1
target 2
property_one [ value 1 value 2 value 3 ]
]
edge [
id 2
source 2
target 3
property_two 1
]
]
graph [
id 1
timestamp_start 1035
timestamp_end 2036
node [
id 1
]
node [
id 2
property_two 9845
]
edge [
id 1
source 1
target 2
property_two 0
]
]
GMLPartition.parseGML accepts as arguments a stream for the template file, and an Iterable of streams for the instance files. For your convenience we have provided a helper class GMLFileIterable, which wraps an Iterable of Paths and represents it as an Iterable of InputStreams. This is accomplished by using each Path to create a FileInputStream on demand. An example usage pattern is shown below.
The first argument to GMLPartition.parseGML is the id of the partition that will be created.
InputStream templateStream = new FileInputStream(new File("~/Partition/Template.gml"));
File files[] = new File("~/Partition/Instances/").listFiles();
List<Path> paths = new ArrayList<>(files.length);
for (File instance : files) {
paths.add(Paths.get(instance.toURI()));
}
GMLPartition partition = GMLPartition.parseGML(1, templateStream, new GMLFileIterable(paths));