Cubed is composed of five layers: from the storage layer at the bottom, to the Array API layer at the top:
Blue blocks are implemented in Cubed, green in Rechunker, and red in other projects like Zarr and Beam.
Let’s go through the layers from the bottom:
Every array in Cubed is backed by a Zarr array. This means that the array type inherits Zarr attributes including the underlying store (which may be on local disk, or a cloud store, for example), as well as the shape, dtype, and chunks. Chunks are the unit of storage and computation in this system.
Cubed uses external runtimes for computation. It follows the Rechunker model (and uses its algorithm) to delegate tasks to stateless executors, which include Python (in-process), Lithops, Modal, and Apache Beam.
There are two primitive operations on arrays:
- Applies a function to multiple blocks from multiple inputs, expressed using concise indexing rules.
- Changes the chunking of an array, without changing its shape or dtype.
These are built on top of the primitive operations, and provide functions that are needed to implement all array operations.
- Applies a function elementwise to its arguments, respecting broadcasting.
- Applies a function to corresponding blocks from multiple inputs.
- Applies a function across blocks of a new array, reading directly from side inputs (not necessarily in a blockwise fashion).
- Subsets an array, along one or more axes.
- Applies a function to reduce an array along one or more axes.
- A reduction that returns the array indexes, not the values.
The new Python Array API was chosen for the public API as it provides a useful, well-defined subset of the NumPy API. There are a few extensions, including Zarr IO, random number generation, and operations like
map_blocks which are heavily used in Dask applications.