NEP 22 — Duck typing for NumPy arrays – high level overview#
- Author:
Stephan Hoyer <shoyer@google.com>, Nathaniel J. Smith <njs@pobox.com>
- Status:
Final
- Type:
Informational
- Created:
2018-03-22
- Resolution:
https://mail.python.org/pipermail/numpy-discussion/2018-September/078752.html
Abstract#
We outline a high-level vision for how NumPy will approach handling “duck arrays”. This is an Informational-class NEP; it doesn’t prescribe full details for any particular implementation. In brief, we propose developing a number of new protocols for defining implementations of multi-dimensional arrays with high-level APIs matching NumPy.
Detailed description#
Traditionally, NumPy’s ndarray
objects have provided two things: a
high level API for expression operations on homogeneously-typed,
arbitrary-dimensional, array-structured data, and a concrete
implementation of the API based on strided in-RAM storage. The API is
powerful, fairly general, and used ubiquitously across the scientific
Python stack. The concrete implementation, on the other hand, is
suitable for a wide range of uses, but has limitations: as data sets
grow and NumPy becomes used in a variety of new environments, there
are increasingly cases where the strided in-RAM storage strategy is
inappropriate, and users find they need sparse arrays, lazily
evaluated arrays (as in dask), compressed arrays (as in blosc), arrays
stored in GPU memory, arrays stored in alternative formats such as
Arrow, and so forth – yet users still want to work with these arrays
using the familiar NumPy APIs, and reuse existing code with minimal
(ideally zero) porting overhead. As a working shorthand, we call these
“duck arrays”, by analogy with Python’s “duck typing”: a “duck array”
is a Python object which “quacks like” a numpy array in the sense that
it has the same or similar Python API, but doesn’t share the C-level
implementation.
This NEP doesn’t propose any specific changes to NumPy or other projects; instead, it gives an overview of how we hope to extend NumPy to support a robust ecosystem of projects implementing and relying upon its high level API.
Terminology#
“Duck array” works fine as a placeholder for now, but it’s pretty jargony and may confuse new users, so we may want to pick something else for the actual API functions. Unfortunately, “array-like” is already taken for the concept of “anything that can be coerced into an array” (including e.g. list objects), and “anyarray” is already taken for the concept of “something that shares ndarray’s implementation, but has different semantics”, which is the opposite of a duck array (e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is a classic bike-shed so for now we’re just using “duck array”. Some possible options though include: arrayish, pseudoarray, nominalarray, ersatzarray, arraymimic, …
General approach#
At a high level, duck array support requires working through each of the API functions provided by NumPy, and figuring out how it can be extended to work with duck array objects. In some cases this is easy (e.g., methods/attributes on ndarray itself); in other cases it’s more difficult. Here are some principles we’ve found useful so far:
Principle 1: focus on “full” duck arrays, but don’t rule out “partial” duck arrays#
We can distinguish between two classes:
“full” duck arrays, which aspire to fully implement np.ndarray’s Python-level APIs and work essentially anywhere that np.ndarray works
“partial” duck arrays, which intentionally implement only a subset of np.ndarray’s API.
Full duck arrays are, well, kind of boring. They have exactly the same semantics as ndarray, with differences being restricted to under-the-hood decisions about how the data is actually stored. The kind of people that are excited about making numpy more extensible are also, unsurprisingly, excited about changing or extending numpy’s semantics. So there’s been a lot of discussion of how to best support partial duck arrays. We’ve been guilty of this ourself.
At this point though, we think the best general strategy is to focus our efforts primarily on supporting full duck arrays, and only worry about partial duck arrays as much as we need to make sure we don’t accidentally rule them out for no reason.
Why focus on full duck arrays? Several reasons:
First, there are lots of very clear use cases. Potential consumers of the full duck array interface include almost every package that uses numpy (scipy, sklearn, astropy, …), and in particular packages that provide array-wrapping-classes that handle multiple types of arrays, such as xarray and dask.array. Potential implementers of the full duck array interface include: distributed arrays, sparse arrays, masked arrays, arrays with units (unless they switch to using dtypes), labeled arrays, and so forth. Clear use cases lead to good and relevant APIs.
Second, the Anna Karenina principle applies here: full duck arrays are all alike, but every partial duck array is partial in its own way:
xarray.DataArray
is mostly a duck array, but has incompatible broadcasting semantics.xarray.Dataset
wraps multiple arrays in one object; it still implements some array interfaces like__array_ufunc__
, but certainly not all of them.pandas.Series
has methods with similar behavior to numpy, but unique null-skipping behavior.scipy’s
LinearOperator
s support matrix multiplication and nothing elseh5py and similar libraries for accessing array storage have objects that support numpy-like slicing and conversion into a full array, but not computation.
Some classes may be similar to ndarray, but without supporting the full indexing semantics.
And so forth.
Despite our best attempts, we haven’t found any clear, unique way of
slicing up the ndarray API into a hierarchy of related types that
captures these distinctions; in fact, it’s unlikely that any single
person even understands all the distinctions. And this is important,
because we have a lot of APIs that we need to add duck array support
to (both in numpy and in all the projects that depend on numpy!). By
definition, these already work for ndarray
, so hopefully getting
them to work for full duck arrays shouldn’t be so hard, since by
definition full duck arrays act like ndarray
. It’d be very
cumbersome to have to go through each function and identify the exact
subset of the ndarray API that it needs, then figure out which partial
array types can/should support it. Once we have things working for
full duck arrays, we can go back later and refine the APIs needed
further as needed. Focusing on full duck arrays allows us to start
making progress immediately.
In the future, it might be useful to identify specific use cases for duck arrays and standardize narrower interfaces targeted just at those use cases. For example, it might make sense to have a standard “array loader” interface that file access libraries like h5py, netcdf, pydap, zarr, … all implement, to make it easy to switch between these libraries. But that’s something that we can do as we go, and it doesn’t necessarily have to involve the NumPy devs at all. For an example of what this might look like, see the documentation for dask.array.from_array.
Principle 2: take advantage of duck typing#
ndarray
has a very large API surface area:
In [1]: len(set(dir(np.ndarray)) - set(dir(object)))
Out[1]: 138
And this is a huge underestimate, because there are also many
free-standing functions in NumPy and other libraries which currently
use the NumPy C API and thus only work on ndarray
objects. In type
theory, a type is defined by the operations you can perform on an
object; thus, the actual type of ndarray
includes not just its
methods and attributes, but all of these functions. For duck arrays
to be successful, they’ll need to implement a large proportion of the
ndarray
API – but not all of it. (For example,
dask.array.Array
does not provide an equivalent to the
ndarray.ptp
method, presumably because no-one has ever noticed or
cared about its absence. But this doesn’t seem to have stopped people
from using dask.)
This means that realistically, we can’t hope to define the whole duck
array API up front, or that anyone will be able to implement it all in
one go; this will be an incremental process. It also means that even
the so-called “full” duck array interface is somewhat fuzzily defined
at the borders; there are parts of the np.ndarray
API that duck
arrays won’t have to implement, but we aren’t entirely sure what those
are.
And ultimately, it isn’t really up to the NumPy developers to define what does or doesn’t qualify as a duck array. If we want scikit-learn functions to work on dask arrays (for example), then that’s going to require negotiation between those two projects to discover incompatibilities, and when an incompatibility is discovered it will be up to them to negotiate who should change and how. The NumPy project can provide technical tools and general advice to help resolve these disagreements, but we can’t force one group or another to take responsibility for any given bug.
Therefore, even though we’re focusing on “full” duck arrays, we don’t attempt to define a normative “array ABC” – maybe this will be useful someday, but right now, it’s not. And as a convenient side-effect, the lack of a normative definition leaves partial duck arrays room to experiment.
But, we do provide some more detailed advice for duck array implementers and consumers below.
Principle 3: focus on protocols#
Historically, numpy has had lots of success at interoperating with
third-party objects by defining protocols, like __array__
(asks
an arbitrary object to convert itself into an array),
__array_interface__
(a precursor to Python’s buffer protocol), and
__array_ufunc__
(allows third-party objects to support ufuncs like
np.exp
).
NEP 16 took a
different approach: we need a duck-array equivalent of
asarray
, and it proposed to do this by defining a version of
asarray
that would let through objects which implemented a new
AbstractArray ABC. As noted above, we now think that trying to define
an ABC is a bad idea for other reasons. But when this NEP was
discussed on the mailing list, we realized that even on its own
merits, this idea is not so great. A better approach is to define a
method that can be called on an arbitrary object to ask it to
convert itself into a duck array, and then define a version of
asarray
that calls this method.
This is strictly more powerful: if an object is already a duck array,
it can simply return self
. It allows more correct semantics: NEP
16 assumed that asarray(obj, dtype=X)
is the same as
asarray(obj).astype(X)
, but this isn’t true. And it supports more
use cases: if h5py supported sparse arrays, it might want to provide
an object which is not itself a sparse array, but which can be
automatically converted into a sparse array. See NEP <XX, to be
written> for full details.
The protocol approach is also more consistent with core Python
conventions: for example, see the __iter__
method for coercing
objects to iterators, or the __index__
protocol for safe integer
coercion. And finally, focusing on protocols leaves the door open for
partial duck arrays, which can pick and choose which subset of the
protocols they want to participate in, each of which have well-defined
semantics.
Conclusion: protocols are one honking great idea – let’s do more of those.
Principle 4: reuse existing methods when possible#
It’s tempting to try to define cleaned up versions of ndarray methods
with a more minimal interface to allow for easier implementation. For
example, __array_reshape__
could drop some of the strange
arguments accepted by reshape
and __array_basic_getitem__
could drop all the strange edge cases of
NumPy’s advanced indexing.
But as discussed above, we don’t really know what APIs we need for
duck-typing ndarray. We would inevitably end up with a very long list
of new special methods. In contrast, existing methods like reshape
and __getitem__
have the advantage of already being widely
used/exercised by libraries that use duck arrays, and in practice, any
serious duck array type is going to have to implement them anyway.
Principle 5: make it easy to do the right thing#
Making duck arrays work well is going to be a community effort. Documentation helps, but only goes so far. We want to make it easy to implement duck arrays that do the right thing.
One way NumPy can help is by providing mixin classes for implementing
large groups of related functionality at once.
NDArrayOperatorsMixin
is a good example: it allows for
implementing arithmetic operators implicitly via the
__array_ufunc__
method. It’s not complete, and we’ll want more
helpers like that (e.g. for reductions).
(We initially thought that the importance of these mixins might be an argument for providing an array ABC, since that’s the standard way to do mixins in modern Python. But in discussion around NEP 16 we realized that partial duck arrays also wanted to take advantage of these mixins in some cases, so even if we did have an array ABC then the mixins would still need some sort of separate existence. So never mind that argument.)
Tentative duck array guidelines#
As a general rule, libraries using duck arrays should insist upon the
minimum possible requirements, and libraries implementing duck arrays
should provide as complete of an API as possible. This will ensure
maximum compatibility. For example, users should prefer to rely on
.transpose()
rather than .swapaxes()
(which can be implemented
in terms of transpose), but duck array authors should ideally
implement both.
If you are trying to implement a duck array, then you should strive to
implement everything. You certainly need .shape
, .ndim
and
.dtype
, but also your dtype attribute should actually be a
numpy.dtype
object, weird fancy indexing edge cases should ideally
work, etc. Only details related to NumPy’s specific np.ndarray
implementation (e.g., strides
, data
, view
) are explicitly
out of scope.
A (very) rough sketch of future plans#
The proposals discussed so far – __array_ufunc__
and some kind of
asarray
protocol – are clearly necessary but not sufficient for
full duck typing support. We expect the need for additional protocols
to support (at least) these features:
Concatenating duck arrays, which would be used internally by other array combining methods like stack/vstack/hstack. The implementation of concatenate will need to be negotiated among the list of array arguments. We expect to use an
__array_concatenate__
protocol like__array_ufunc__
instead of multiple dispatch.Ufunc-like functions that currently aren’t ufuncs. Many NumPy functions like median, percentile, sort, where and clip could be written as generalized ufuncs but currently aren’t. Either these functions should be written as ufuncs, or we should consider adding another generic wrapper mechanism that works similarly to ufuncs but makes fewer guarantees about how the implementation is done.
Random number generation with duck arrays, e.g.,
np.random.randn()
. For example, we might want to add new APIs likerandom_like()
for generating new arrays with a matching shape and type – though we’ll need to look at some real examples of how these functions are used to figure out what would be helpful.Miscellaneous other functions such as
np.einsum
,np.zeros_like
, andnp.broadcast_to
that don’t fall into any of the above categories.Checking mutability on duck arrays, which would imply that they support assignment with
__setitem__
and the out argument to ufuncs. Many otherwise fine duck arrays are not easily mutable (for example, because they use some kinds of sparse or compressed storage, or are in read-only shared memory), and it turns out that frequently-used code like the default implementation ofnp.mean
needs to check this (to decide whether it can reuse temporary arrays).
We intentionally do not describe exactly how to add support for these types of duck arrays here. These will be the subject of future NEPs.
Copyright#
This document has been placed in the public domain.