Architecture Usage example Contact us

From Data Lake to GPU

BlazingSQL provides a simple SQL interface to ETL massive datasets into GPU memory for AI and Deep Learning workloads.
Learn More Contact us »

BlazingSQL & RAPIDS software

BlazingSQL is built on open source projects and free to use. Developed in collaboration with the RAPIDS software, BlazingSQL offers an intuitive and scalable SQL interface to bring large data sets from persistent storage solutions into GPU memory and therefore the RAPIDS software.


BlazingSQL is heavily integrated with distributed object stores, commonly referred to as 'Data Lakes', in order to most effectively extract value from existing large data sets.
Data Preparation

DataFrame - cuDF

This is a GPU accelerated dataframe-manipulation library based on GPU Apache Arrow. It’s designed to enable data wrangling data for model training. The Python bindings of the core-accelerated, low-level CUDA C++ kernels mirror the Pandas API for seamless onboarding and transition from Pandas.
Felipe Aramburu
William Malpica
VP of Engineering
Percy Triveño
Engineering Manager
Cristhian Gonzales
Software Engineer, C++
Alexander Ocsa
Software Engineer, C++
Christian Noboa
Software Engineer, C++
Rommel Quintanilla
Software Engineer, C++

Apache arrow in GPU Memory

This is a columnar, in-memory data structure that delivers efficient and fast data interchange with flexibility to support complex data models.
Felipe Aramburu
Percy Triveño
Engineering Manager
Model Training

Machine Learning Libraries - cuML

This collection of GPU-accelerated machine learning libraries will eventually provide GPU versions of all machine learning algorithms available in Scikit-Learn.
Model Training

Graph Analytics Libraries - cuGraph

This collection of graph analytics libraries that seamlessly integrates into the RAPIDS data science software suite.
Model Training

Deep Learning Libraries - cuDNN

RAPIDS provides native array_interface support. This means data stored in Apache Arrow can be seamlessly pushed to deep learning frameworks that accept array_interface such as PyTorch and Chainer.

Visualization Libraries - RTX

Coming soon. RAPIDS will include tightly integrated data visualization libraries based on Apache Arrow. Native GPU in-memory data format provides high-performance, high-FPS data visualization, even with very large datasets.
Data Preparation
Model Training


BlazingSQL offers a Python interface to directly query flat files (limited to Apache Parquet initially), and output the results to GPU in-memory, Apache Arrow DataFrames. With this DataFrame users can use PyGDF or Dask_GDF which provide a simple interface that is similar to the Pandas DataFrame.
  from blazingSQL import connection
  from blazingSQL import connection
  from blazingSQL import blazingSQLClient
  from blazingSQL import inputData
  from BlazingFileSystemManager import FileSystemManager 
  from pygdf.dataframe import DataFrame
  from libgdf_cffi import ffi, libgdf
  from collections import OrderedDict

  print ('*** Open Connection ***')
  connection = Connection ('/tmp/orchestrator.socket', '/tmp/ral.socket')
  access_token =
  client = blazingSQLClient(access_token)

  print ('*** Register a S3 File System ***')
  fs_mngr = FileSystemManager(access_token)
  fs_name = "company"
  bucket_name = "company_bucket"
  encryption_type = "None"
  kms_key_amazon_resource_name = ""
  access_key_id = "accessKeyIddsf3"
  secret_key = "secretKey234"
  session_token = ""
  root = "/organization/subgroup/project/subproject/subunit/"
  fs_status = fs_mngr.registerS3FileSystem(fs_name, bucket_name, encryption_type,  
  kms_key_amazon_resource_name, access_key_id, secret_key, session_token, root)
  print ('*** Create a table by manually defining the schema ***')
  db_name = "myDb"
  status = client.createDatabase(db_name)
  table_nameA = "customer"
  columns = OrderedDict()
  columns["c_custkey"] = libgdf.GDF_INT32
  columns["c_nationkey"] = libgdf.GDF_INT8
  columns["c_acctbal"] = libgdf.GDF_FLOAT32
  status = client.createTable(db_name, table_nameA, columns)

  print ('*** Create a table from an Apache Parquet file schema ***')
  parquet_filepaths_set = []
  table_nameB = "orders"
  status = client.createTableFromParquetFile(db_name, table_nameB, parquet_filepaths_set)

  print ('*** Define inputs for a SQL query ***')
  # an input can be any GDF created by pyGDF
  gdfDF1 = pygdf.readParquet("/home/demo/test_data/customer/customer.parquet")
  # add inputs to an input data set
  input_dataset = []
  input_dataset.append(inputData(table_nameA, gdfDF1))
  # an input can also be a list of Apache Parquet files
  input_dataset.append(inputData(table_nameB, parquet_filepaths_set))

  print ('*** Run a query ***')
  query = """select sum(orders.o_totalprice) from orders as o
  inner join customer as c on o.o_custkey = c.c_custkey
  where c.c_custkey = 1;"""
  result_tokens = client.runQuery(query, input_dataset)
  resultResponse = client.getResult(result_tokens[0])
  print('     status: %s' % resultResponse.metadata.status)
  print('    message: %s' % resultResponse.metadata.message)
  print('       time: %s' % resultResponse.metadata.time)
  result_GDF =


A Collaboratively Built Ecosystem
BlazingSQL is the fastest and easiest way to pull data from Data Lakes into AI workloads.

We are here to help and we would love to chat with you.

Contact us »

Get Help

If you have any questions, please reach out.
fab fa-twitter


Follow us
fab fa-facebook-f


Know us
fab fa-medium-m


fab fa-linkedin-in


Find out more
©2015 - 2018 | BlazingDB