Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Description & Scenarios

njzhxf edited this page Sep 24, 2014 · 9 revisions

JStorm is a distributed real-time computation engine.

JStorm is a distributed and fault-tolerant realtime computation system. Inspired by Apache Storm, JStorm has been completed rewritten in Java and provides many more enhanced features. JStorm has been widely used in many enterprise environments and has been proved robust and stable.

JStorm provides a distributed programming framework similar to Hadoop MapReduce. The developer only needs to compose his/her own pipe-lined computation logic by implementing the JStorm API (which is fully compatible with Apache Storm API) and submit the composed "Topology" to a working JStorm instance.

Similar to Hadoop MapReduce, JStorm computes on a DAG (directed acyclic graph). Different from Hadoop MapReduce, a JStorm topology runs 24 * 7, therefore more suitable for streaming data and real-time "in memory" computation. A Apache Hadoop MapReduce job submitted to the JobTracker only runs once.

JStorm guarantees fault-tolerance. Whenever a worker process crashes, the scheduler embedded in the JStorm instance immediately spawns a new worker process to substitute the failed one. The "Acking" framework provided by JStorm guarantees that every single piece of data is processed at least once.

Advantages

There are many real-time computation engines before Storm and JStorm, but the Storm and JStorm become more and more popular since they appeared. The advantages are listed below:

  • Quick start, the programming interface is easy to learn. To develop a good scalability application, without thinking about the bottom rpc, redundancy between the worker, data distribution, developers just need to observe the programming specifications of Topology, Spout and Bolt.
  • Excellent scalability, you can get a linear expansion performance by double the parallelism of the component directly.
  • Robust, the scheduler is able to automatically assign a new worker to replace invalid worker when the worker fails or the machine breaks down.
  • Accuracy of the data, user can use Acker mechanism to prevent data loss. If there are more steps on the accuracy requirements, using transaction mechanism to ensure data accuracy.

Scenarios

The way JStorm processing data is based on the message processing pipeline , which is particularly suitable for stateless calculation. That is, the data on which the calculation unit depends, shall all be able to be found in the received messages, and preferably a data stream is not dependent on another data stream. Therefore, it is often used like:

  • Log analysis. JStorm can be used to analyze specific data from the log, and store the analysis results in an external storage system such as a database. Currently, most of log analysis system is basing on JStorm or Storm.
  • Pipeline system. JStorm can be used to transfer data from one system to another, such as synchronizing data to Hadoop.
  • Message converter. According to a certain format, JStorm can convert the messages received, and then store them into another system, such as a messaging middleware.
  • Statistical analyzer. JStrom can be used to extract a certain field from the logs or messages, then makes calculation to count or sum the number, and finally stores the statistics in the external storage. The intermediate processing may be more complex.
Clone this wiki locally