Skip to content

FreeEed architecture

Mark Kerzner edited this page Mar 16, 2016 · 3 revisions

FreeEed architecture (overview for developers)

Processing

  1. Staging. Packages all files in the project that are set up for eDiscovery, and puts them into zip files. This serves a double purpose: create copies of all files for processing and not the original ones, and these same zip files will be distributed to individual Hadoop mappers.
  2. Each mapper gets on zip file from those created in staging. It mounts this zip file as a file system, using TrueZip library and processes one file at a time. Periodically it checks back with the job master to tell it that is alive (this is only needed for large files, which take more than 10 min to process). The mapper extract text and metadata, does OCR and generate PDF images if requested. Then it packages all that into a map and emits that map. The key is hash signature of the file. This results in automatic deduplication.
  3. The reducer loops through keys and output all files for the same key (dupes) next to each other, marking the first one as the master, and others as duplicates. The reducer produces two outputs: the regular text one (this is the load file) and the output zip file with the native files, text and PDF. FreeEed sets up the number of reducers to one. Normally this is an anti-pattern, but FreeEed takes care to do all heavy processing in the mapper, so reducer only writes and is not a bottleneck. If you run directly on the cluster with command line, you can set the number of reducers to more than 1. You will have to take care of straight through number, because each reducer will start numbering from 1.
  4. When FreeEed is running its processing, it sends the search index results to a SOLR server, which has to be provided. It also create a project in the FreeEedUI (which is a web application described below).

Review

  1. FreeEedUI runs in a servlet container (such as Tomcat), and communicates to the SOLR server. It accepts the results of FreeEed processing by either using the "Import" or by copying the output of FreeEed into a location on the server where FreeEedUI is running, and specifying this location. FreeEed does this automatically after finishing processing successfully.
  2. The review part of FreeEedUI allows eDiscovery searches, document exports, singly or by groups, and labeling documents of interest. Currently there is no capability to add notes to documents. The search uses Lucene search and can be of any complexity, which leads to powerful search capability.
Clone this wiki locally