Skip to content
Yohann Jardin edited this page Mar 18, 2018 · 6 revisions

Rest HDFS reading

Rest HDFS reading regroups tools that allow reading data on HDFS in a synchronous fashion. The goal is to propose installation guide and benchmark instructions for a variety of solutions that answer different needs. (Performance oriented, cost savvy, etc.)

Why working on such a project

Gathering data is one thing, but accessing it in real time is another thing. In the case of analytics in particular, a lot of tools exist to query/process data, and there is a lack of benchmarks on that matter to decide which one should be used.

That’s also a way to learn about different tools and solutions.

Prerequisites before trying any proposed solution

In order to have a clean benchmark environment, with reproducible results, services are running with Docker.

Some solutions need to compile code, for this purpose Gradle is expected.

Solutions / Technology used

Apache-Drill, a query engine to read HDFS data.

Apache-Ignite, used as an in memory file system over HDFS.

A WebService running on Apache-Spark as a long-running job. Serves as an alternative to Apache-Drill to query HDFS data.

Getting started

Datasets that can be used for benchmarks.