Apache Storm a real-time data processing system

storm_logo

Apache Storm is a project that allows you to organize the processing guaranteed of various events in real time. For example, Storm can be used to analyze data streams in real time, perform machine learning tasks, organize continuous calculations, implement RPC, ETL, etc.

The system supports clustering, lTo build fault-tolerant configurations, guaranteed data processing mode, and has high enough throughput to process more than a million requests per second on a cluster node.

Apache Storm integration with various queue processing systems and database technologies.

The architecture of Storm involves receiving and processing unstructured data streams and constantly updated using arbitrary complex controllers with the possibility of dividing between different calculation stages.

About Apache Storm

The project was transferred to the Apache community after the acquisition of Twitter by BackType, the company that originally developed the framework.

In practice, Storm was used in BackType to analyze the reflection of events in microblogs, by comparing new tweets on the fly and the links that were used in them (for example, they were evaluated as external links or Twitter ads were broadcast by other participants).

Storm functionality compares to Hadoop platform, and the key difference is that the data is not put into the repository, but is received from the outside and processed in real time.

In Storm, there is no built-in storage layer and the analytic query begins to apply to the incoming data until it is canceled (if Hadoop uses the MapReduce job that takes up a finite time, then Storm uses the idea of ​​running "topologies" continuously.

The execution of the handlers can be distributed to several servers: the Storm automatically parallelizes the work with threads in different nodes of the cluster.

Main use cases that can be given to Apache Storm

Processing new data streams or database updates in real time
Continuous calculations: Storm can make continuous requests and process continuous flows, transferring the results of the processing to the client in real time.

Distributed remote procedure call (RPC): A storm can be used to provide concurrency in executing resource-intensive queries.

A task ("topology") in Storm is a distributed function between nodes that is waiting for incoming messages to be processed.

After receiving the message, the function processes it in a local context and returns the result. An example of using distributed RPC could be parallel processing of search queries or performing operations on a large set of sets.

Apache Storm 2.0 Main New Features

The Apache Foundation launched initiatives to transfer Storm to a new kernel written in Java, the results of which are proposed in the Apache Storm 2.0 version.

All the basic components of the platform are rewritten in Java. Support for writing handlers in Clojure is retained, but is now offered in the form of links. Java 8 is required for Storm 2.0.0 to work.

The multithreaded processing model has been completely redesigned, which has resulted in a notable performance increase (for some topologies, latencies have been reduced by 50-80%).

In the new version a new typified Streams API was proposed, which allows you to configure handlers using operations in the functional programming style.

The new API is implemented on the basis of the regular API and supports automatic merging of operations to optimize their processing. The Windowing API for window operations adds support for saving and restoring state in the backend.

On the other hand the controller to start additional resources into account when making decisions that are not limited to CPU and memory, such as network and GPU parameters, it has been added to the boot scheduler.

A host of enhancements related to ensuring integration with the Kafka platform.
The access control system has been expanded, in which the opportunity has arisen to create administrator groups and token delegation.

Added improvements related to support for SQL and metrics. The administrator interface has new commands for debugging the cluster state.


Be the first to comment

Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.