-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description & Scenarios
JStorm is a distributed and fault-tolerant realtime computation system. Inspired by Apache Storm, JStorm has been completedly rewritten in Java and provides many more enhanced features. JStorm has been widely used in many enterprise environments and proved robust and stable.
JStorm provides a distributed programming framework very similar to Hadoop MapReduce. The developer only needs to compose his/her own pipe-lined computation logic by implementing the JStorm API (which is fully compatible with Apache Storm API) and submit the composed "Topology" to a working JStorm instance.
Similar to Hadoop MapReduce, JStorm computes on a DAG (directed acyclic graph). Different from Hadoop MapReduce, a JStorm topology runs 24 * 7, the very nature of its continuity and 100% "in-memory" architecture has been proved a particularly suitable solution for streaming data and real-time computation.
JStorm guarantees fault-tolerance. Whenever a worker process crashes, the scheduler embedded in the JStorm instance immediately spawns a new worker process to substitute the failed one. The "Acking" framework provided by JStorm guarantees that every single piece of data will be processed at least once.
There are many real-time computation engines before Storm and JStorm, but the Storm and JStorm become more and more popular since they appeared. The advantages are listed below:
- Quick start, the programming interface is easy to learn. To develop a good scalability application, without thinking about the bottom rpc, redundancy between the worker, data distribution, developers just need to observe the programming specifications of Topology, Spout and Bolt.
- Excellent scalability, you can get a linear expansion performance by double the parallelism of the component directly.
- Robust, the scheduler is able to automatically assign a new worker to replace invalid worker when the worker fails or the machine breaks down.
- Accuracy of the data, user can use Acker mechanism to prevent data loss. If there are more steps on the accuracy requirements, using transaction mechanism to ensure data accuracy.
The way JStorm processing data is based on the message processing pipeline , which is particularly suitable for stateless calculation. That is, the data on which the calculation unit depends, shall all be able to be found in the received messages, and preferably a data stream is not dependent on another data stream. Therefore, it is often used like:
- Log analysis. JStorm can be used to analyze specific data from the log, and store the analysis results in an external storage system such as a database. Currently, most of log analysis system is basing on JStorm or Storm.
- Pipeline system. JStorm can be used to transfer data from one system to another, such as synchronizing data to Hadoop.
- Message converter. According to a certain format, JStorm can convert the messages received, and then store them into another system, such as a messaging middleware.
- Statistical analyzer. JStrom can be used to extract a certain field from the logs or messages, then makes calculation to count or sum the number, and finally stores the statistics in the external storage. The intermediate processing may be more complex.