Thought this was cool: A Python Compiler for Big Data
Comments: “A Python Compiler for Big Data”
Blaze is the next generation of NumPy, Python’s extremely popular
array library. At Continuum Analytics we aim to tackle some of the
hardest problems in large data analytics with
our Python stack of Numba and Blaze, which together will form the basis
of distributed computation and storage system which is simultaneously
able to generate optimized machine code specialized to the data being
Blaze aims to extend the structural properties of NumPy arrays to to a
wider variety of table and array-like structures that support commonly
requested features such missing values, type heterogeneity, and labeled arrays.
Unlike NumPy, Blaze is designed to handle out-of-core computations on
large datasets that exceed the system memory capacity, as well as on
distributed and streaming data. Blaze is able to operate on datasets
transparently as if they behaved like in-memory NumPy arrays.
We aim to allow analysts and scientists to productively write robust
and efficient code, without getting bogged down in the details of how
to distribute computation, or worse, how to transport and convert data
between databases, formats, proprietary data warehouses, and other silos.
The core mode of operation for Blaze is a construction of lazy
expression graphs, much in the style of Theano. A graph is constructed
for each node corresponding to a source of data or a ByteProvider. The
behavior is similar to an ORM in that operations over the objects don’t
correspond to immediate computations but instead construct the
query or execution plan over the data.
Most importantly the data in Blaze can be imported from a wide variety
of sources including on-disk arrays. Together with IOPro we aim to be
able to import data from CSV, Amazon S3, and SQL Databases as seamlessly
as if they were local files.
These then construct a graph representation of the expression
which can be evaluated executed with
eval to produce
Blaze introduces a richer grammar for describing the structural and
value type properties of data. We call this description the “datashape”
of the data points and it forms a superset of NumPy’s
Once a graph is evaluated Blaze attempts to gather all available type
and metadata available from the user input to inform better computation
selection and scheduling. The compiler converts expressions graph
objects into an intermediate form called ATerm, drawn from the
StrategoXT project. This intermediate form is roughly a subset of Python
expressions but allows the explicit annotation of type and metadata
information directly on the AST. The ATerm IR forms the meeting point
where both Numba and Blaze can come together to code generation and
graph rewriting to produce more efficient kernels.
Expressions that are not explicitly typed need to be inferred from
their usage across the entire graph together or determined at runtime.
The core libraries of Blaze will be explicitly annotated with type
information so that together with with the type signatures of the
operators and functions in question we can use Milner style type
inference to allow the end user omit the explicit declaration of type
information as much as possible.
Once an efficient execution plan is generated it is executed by the
Blaze runtime. Because our implementation does not explicitly depend on
Python we are able to overcome many of the shortcomings of the Python
runtime such as running without the GIL and utilising real threads
to dispatch custom Numba kernels running at near C speed without the
performance limitations of Python.
One of the primary complaints about NumPy is the inability to mitigate
the effects of temporaries and the roundtrips between Python and NumPy.
With Blaze we are able to fuse the entire execution into a single
dispatch which is more efficient than equivalent sequencing of ufunc
objects and allocation of temporaries in Python space.
In addition to faster serial execution, our proprietary products
such as NumbaPro will capable of mapping computations onto a variety
of modern hardware such as GPUs to utilize more sophisticated
parallelization techniques to further increase the performance of
One can think of Blaze and Numba as being two complementary parts of the
plan to bring Python into the large data analytics world. Together Blaze
and Numba form a compiler-like infrastructure with Blaze as the type
system and symbol table to complement Numba’s code generation.
comments powered by