Automatic generation of graphs from log files
First, we describe some log files of interest, then describe the tool.
The graph engine Giraph ingests graph in a list format: [source_id,source_value,[[dest_id, edge_value],...]].
Consider the use case of analyzing HDFS logs. Apache HDFS is a distributed file system designed to handle large data. A distributed system running algorithms on Hadoop with HDFS has an inherent graph structure produced by the block (data) transfers between nodes. For example, the first line in the figure above has source (src) and destination (dest) IP/port. A python script scans each line for src
and dest
in the line and is considered as an edge from source to destination. Multiple occurences of an edge in the log file will result in a higher edge weights.
As another example, consider pcap logs. Pcap can capture network traffic and save it as a .pcap
file. The IP Layer captured by PCAP can be used to create a meaningful graph of source IPs and destination IPs. Like in HDFS logs, we assign higher edge weights for multiple occurences of edges.
To generate a random log for testing, we can leverage Python's Logging facility. For example, the figure above shows a snippet of a random log file. This log file documents the day, time, user, type of error and IPs in question. A random log file might use any delimiter, but certain ones are more common
We desire an approach that works on all of the above types of logs. We developed a tool that allows the user to specify which columns correspond to the nodes (and the infers edges). While this of course yields good results, it is a far cry from our goal of automation. We therefore also developed a rules-based approach. This approach first predicts what delimiter a log file is using by checking for delimiters ',' ' ' '\t' ';' in that order. After the delimiter is predicted, we look for matching columns as follows, leveraging the fact that all nodes must be of the same “type.” We compare the set of values in each column with the set of values in every other column, and look whether a set intersection between the columns will results in an intersection greater than a pre-selected threshold. A set intersection is performed by finding unique elements that are common in the two columns. If the intersection is higher than the threshold, we select the columns as nodes to generate a Giraph graph from it. We do this for each combination of column that has a set intersection size higher than the threshold. In testing, this succeeded in perfectly handling all log files described above, including random log files generated for IP addresses, call records, and more. It is packaged it as a container.
docker build -t logtogiraph .
docker run -it --entrypoint="/bin/bash" logtogiraph
python3 hdfs-graph.py <hdf5_log_file>
cd random-log
python3 logexample.py
cd random-log
python3 log-graph.py <random_log_filename>
The graphs will be stored in json files numbered from 0-n e.g. 0.json, 1.json, 2.json.. etc.