Apache Pinot, an open source OLAP data warehouse

Apache Pinot

Apache Pinot is a real-time distributed OLAP data warehouse

Apache Pinot It is an OLAP storage solution distributed designed real-time, used to deliver scalable real-time analytics with low latency.

Can ingest data from batch data sources (such as HDFS, S3, Azure Data Lake, Google Cloud Storage), as well as from streaming sources (such as Kafka). Pinot is designed to scale horizontally, so you can scale to larger data sets and higher query rates as needed.

About Apache Pinot

The Pinot project was originally developed by LinkedIn and in 2015 it was transferred to the Apache Foundation for further joint development. The storage is designed to operate in an environment where new data is constantly added and is designed to provide minimal and predictable latency, allowing the storage to be used for real-time query processing.

Like most other data warehouses and OLAP data storage solutions, Pinot supports SQL-like query language which supports selection, aggregation, filtering, grouping, sorting and distinct queries of data.

Apache Pinot provides horizontal scalability and provides a means to achieve fault tolerance and survivability against software and hardware errors. Replication and backup processes are integrated directly into the processing cycle of data added to the warehouse. On the one hand, this approach allows to significantly simplify the architecture, but on the other hand, it causes a delay between the addition of data and its availability for queries.

Data is stored in tables in a column-oriented database, In addition, several compression schemes and the ability to place multiple values ​​in a field are supported. Pinot provides a pluggable index system that can use various indexing technologies (sorted index, bitmap index, inverted index, StarTree index, Bloom filter, range index, text search index (Lucence/FST), JSON index , geospatial index).

Of the characteristics that stand out from Apache Pinot:

  • Column Oriented– A column-oriented database with various compression schemes such as run length and fixed bit length.
  • Pluggable indexing: Pluggable indexing technologies, Sorted Index, Bitmap Index, Inverted Index.
  • Query optimization- Ability to optimize the query/execution plan based on query and segment metadata.
  • Ingestion of streams and batches: Near real-time ingestion of Hadoop streams and batch ingestion.
  • Consultation: SQL-based query execution engine.
  • Upsert during ingestion in real time: update data at scale with consistency
  • Multiple value fields: support for multi-value fields, allowing you to query fields as comma-separated values.
  • Cloud native on Kubernetes: Helm chart provides a horizontally scalable, fault-tolerant clustered deployment that is easy to manage with Kubernetes.

New version of Apache Pinot

It is worth mentioning that recently Apache Pinot version 1.0 was released, which basically summed up a lot of work to stabilize the code base and take into account the wishes of the community (more than 300 comments were taken into account).

In addition to this, it is highlighted thate the new processing engine multi-stage query (Multi-Stage Query Engine) has reached its full potential, which allows implementing support for merging tables (JOIN). The engine used initially did an excellent job with simple filtering and aggregation operations, but to ensure predictable query execution time, it did not support table join operations.

the new engine includes intermediate stages of complex query processing and SQL semantics are close to ANSI SQL. Additionally, the new version offers native support for processing data in JSON format, provides support for the "NULL" value, integrates with Apache Spark 3.x and improves the implementation of tables in Upsert mode (adding segment compression and providing support for operations elimination).

Finally, if you are interested in being able to know more about it, you should know that the project code is written in Java and distributed under the Apache license. You can check the details of the new version in the following link


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.