BlazingSQL has released its source code for the use of GPUS to speed up data processing

A new open source project wants to take analytics to the next level and is that the people behind BlazingSQL recently announced that it has released the source code for its SQL engine, which is used in GPUs to speed up data processing. BlazingSQL is not a complete DBMS, but it is positioned as an engine to analyze and process large data sets, comparable in its tasks to Apache Spark.

For those who are unfamiliar with BlazingSQL should know that this is a GPU accelerated SQL engine built on the RAPIDS ecosystem which is a set of open source software libraries for running end-to-end analytics and data science pipelines on GPUs.

According to the team, BlazingSQL was created to address the expense, complexity and slow pace that users face when working in large assemblies of data. BlazingSQL is suitable for performing individual analytical queries on large data sets (tens of gigabytes) stored in tabular formats (eg logs, NetFlow statistics, etc.).

To work with the GPU, a set of RAPIDS libraries is used abSome developed with the involvement of NVIDIA, allowing you to create data processing and analysis applications that run entirely on the GPU side (a Python interface is provided to use low-level CUDA primitives and parallel calculations).

BlazingSQL provides the ability to use SQL instead of the API cuUDF data processing (based on Apache Arrow) used by RAPIDS. BlazingSQL is an additional layer that runs on top of cuDF and uses the cuIO library to read data from disk.

SQL queries are translate into cuUDF function calls, which allow data to be loaded onto the GPU and perform merge, aggregate, and filter operations on them. Supports the creation of distributed configurations spanning thousands of GPUs.

Use of El SQL allows RAPIDS to be integrated with existing analytical systems without writing processors without resorting to intermediate loading of data into an additional DBMS, while maintaining full compatibility with all parts of RAPIDS, translating existing functionality into SQL, and ensuring performance at the cuDF level. Includes support for integration with XGBoost and cuML libraries to solve analysis and machine learning tasks.

BlazingSQL can run queries from flat files in CSV and Apache Parquet formats located on network and cloud systems such as HDSF and AWS S3, directly transferring the result to the GPU memory.

Thanks to parallelization operations on the GPU and the use of faster video memory, query execution in BlazingSQL is up to 20 times faster than in Apache Spark.

BlazingSQL greatly simplifies working with data - instead of hundreds of cuDF function calls, you can do it with a single SQL query.

"BlazingSQL addresses these customer concerns not only with an incredibly fast, distributed SQL GPU engine, but also a zealous focus on simplicity," Rodrigo Aramburu, CEO of BlazingSQL, wrote in a subsequent blog. "With a few lines of code, BlazingSQL can query your raw data, wherever it resides, and interoperate with your existing RAPIDS and analytics stack."

BlazingSQL enables users to query enterprise data lake data sets directly in GPU memory as a GPU DataFrame (GDF). GDF is a project that offers support for interoperability between GPU applications. It also defines a common GPU memory data layer.

"By leveraging Apache Arrow on GPUs and integrating with Dask, BlazingSQL will extend open source functionality and drive the next wave of interoperability in the fast-paced data science ecosystem."

For those who are interested should know that the code is written in C ++ with a python interface for users and the open source is under the Apache 2.0 license.

The link is this.


The content of the article adheres to our principles of editorial ethics. To report an error click here!.

Be the first to comment

Leave a Comment

Your email address will not be published.

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.