Accelerating Python code with Numba and LLVM

Graham Markall

Compiler Engineer, Embecosm

graham.markall@embecosm.com

Twitter: @gmarkall

1

Overview

My background:

This talk is an overview of:

2

What is Numba? (1)

A tool that makes Python code go faster by specialising and compiling it.

3

What is Numba? (2)

4

Implementation overview

5

Who uses Numba?

Random selection of users:

6

Numba example

# Mandelbrot function in Python



def mandel(x, y, max_iters):
    c = complex(x,y)
    z = 0j
    for i in range(max_iters):
        z = z*z + c
        if z.real * z.real + z.imag * z.imag >= 4:
            return 255 * i // max_iters

    return 255
7

Numba example

# Mandelbrot function in Python using Numba
from numba import jit

@jit
def mandel(x, y, max_iters):
    c = complex(x,y)
    z = 0j
    for i in range(max_iters):
        z = z*z + c
        if z.real * z.real + z.imag * z.imag >= 4:
            return 255 * i // max_iters

    return 255
8

Mandelbrot, 20 iterations

CPython 1x
Numpy array-wide operations 13x
Numba (CPU) 120x
Numba (NVidia Tesla K20c) 2100x
_images/mandel.png
9

Other examples

Times in msec:

Example CPython Numba Speedup
Black-Scholes 969 433 2.2x
Check Neighbours 550 28 19.9x
IS Distance 372 70 5.4x
Pairwise 62 12 5.1x
10

Dispatch process

Calling a @jit function:

  1. Lookup types of arguments
  2. Do any compiled versions match the types of these arguments?
  1. Yes: retrieve the compiled code from the cache
  2. No: compile a new specialisation
  1. Marshal arguments to native values
  2. Call the native code function
  3. Marshal the native return value to a Python value
11

Dispatch overhead

@jit
def add(a, b):
    return a + b

def add_python(a, b):
    return a + b
>>> %timeit add(1, 2)
10000000 loops, best of 3: 163 ns per loop

>>> %timeit add_python(1, 2)
10000000 loops, best of 3: 85.3 ns per loop
12

Compilation pipeline

_images/archi2.png
13

Type Inference

float32 + float32 -> float32
int32   + float32 -> float64
def f(a, b):   # a:= float32, b:= float32
    c = a + b  # c:= float32
    return c   # return := float32
14

Type Unification

Example typing 1:

def select(a, b, c):  # a := float32, b := float32, c := bool
    if c:
        ret = a       # ret := float32
    else:
        ret = b       # ret := float32
    return ret       # return := {float32, float32}
                      #           => float32
15

Type Unification

Example typing 2:

def select(a, b, c):  # a := tuple(int32, int32), b := float32,
                      # c := bool
    if c:
        ret = a       # ret := tuple(int32, int32)
    else:
        ret = b       # ret := float32
    return ret       # return := {tuple(int32, int32), float32}
                      #           => XXX
16

LLVM Interface

17

LLVM-PY

18

llvmlite

19

CUDA Backend

20

Auto-vectorization

21

Wrap-up

22

Towards Numba 1.0 release

23

llvmlite future directions

24

Further Reading / Information

25

Questions / discussion summary

26

Extra Slides

27

Supported Python Syntax

Inside functions decorated with @jit:

28

Unsupported Python Syntax

Also inside functions decorated with @jit:

Classes cannot be decorated with @jit.

29

Supported Python Features

30

Supported Python modules

Comprehensive list: http://numba.pydata.org/numba-doc/0.21.0/reference/pysupported.html

31

Supported Numpy features

32

Writing Ufuncs

@vectorize
def rel_diff(x, y):
    return 2 * (x - y) / (x + y)

Call:

a = np.arange(1000, dtype = float32)
b = a * 2 + 1
rel_diff(a, b)
33

Generalized Ufuncs

@guvectorize([(int64[:], int64[:], int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y[0]
34

Layout examples

Matrix-vector products:

@guvectorize([(float64[:, :], float64[:], float64[:])],
              '(m,n),(n)->(m)')
def batch_matmul(M, v, y):
    pass # ...

Fixed outputs (e.g. max and min):

@guvectorize([(float64[:], float64[:], float64[:])],
              '(n)->(),()')
def max_min(arr, largest, smallest):
    pass # ...
35

Modes of compilation

36

Loop lifting

@jit
def sum_strings(arr):
    intarr = np.empty(len(arr), dtype=np.int32)
    for i in range(len(arr)):
        intarr[i] = int(arr[i])
    sum = 0

    # Lifted loop
    for i in range(len(intarr)):
        sum += intarr[i]

     return sum
37

Tips 0 - Profiling

_images/gprof2dot.png
38

Tips 1 - General Approach

39

Tips 2 - Don't Specify Types

@jit(float64(float64, float64))
def add(a, b):
    return a + b
40

Tips 3 - Optimisations

for i in range(len(X)):
    Y[i] = sin(X[i])
for i in range(len(Y)):
    Z[i] = Y[i] * Y[i]
  1. Loop fusion:
for i in range(len(X)):
    Y[i] = sin(X[i])
    Z[i] = Y[i] * Y[i]
  1. Array contraction:
for i in range(len(X)):
    Y = sin(X[i])
    Z[i] = Y * Y
41

Tips 4 - Debugging

42

Tips 5 - Releasing the GIL

@numba.jit(nogil=True)
def my_function(x, y, z):
    ...
43

New Numba Features (0.18 - 0.25)

Including:

44

Parallel & CUDA ufuncs / gufuncs

@vectorize([float64(float64, float64)])
def rel_diff_serial(x, y):
     return 2 * (x - y) / (x + y)

@vectorize(([float64(float64, float64)]), target='parallel')
def rel_diff_parallel(x, y):
    return 2 * (x - y) / (x + y)

For 10^8 elements, on my laptop (i7-2620M, 2 cores + HT):

%timeit rel_diff_serial(x, y)
# 1 loop, best of 3: 556 ms per loop

%timeit rel_diff_parallel(x, y)
# 1 loop, best of 3: 272 ms per loop
45

Parallel / CUDA (g)ufunc guidelines

46

Generated functions

Dispatch based on argument:

47

Generated function example: (1/3)

1-norm for scalar, vector and matrix:

def scalar_1norm(x):
    '''Absolute value of x'''
    return math.fabs(x)

def vector_1norm(x):
    '''Sum of absolute values of x'''
    return np.sum(np.abs(x))

def matrix_1norm(x):
    '''Max sum of absolute values of columns of x'''
    colsums = np.zeros(x.shape[1])
    for i in range(len(colsums)):
        colsums[i] = np.sum(np.abs(x[:, i]))
    return np.max(colsums)
48

Generated function example (2/3)

JITting into a single function using @generated_jit:

def bad_1norm(x):
    raise TypeError("Unsupported type for 1-norm")

@generated_jit(nopython=True)
def l1_norm(x):
    if isinstance(x, types.Number):
        return scalar_1norm
    if isinstance(x, types.Array) and x.ndim == 1:
        return vector_1norm
    elif isinstance(x, types.Array) and x.ndim == 2:
        return matrix_1norm
    else:
        return bad_1norm
49

Generated function example (3)

Calling the generated function:

# Calling

x0 = np.random.rand()
x1 = np.random.rand(M)
x2 = np.random.rand(M * N).reshape(M, N)

l1_norm(x0)
l1_norm(x1)
l1_norm(x2)

# TypeError("Unsupported type for 1-norm")
l1_norm(np.zeros((10, 10, 10))
50

Generated functions guidelines

File "/home/pydata/anaconda3/envs/pydata/lib/python3.5/inspect.py", line 2156,
         in _signature_from_callable
    raise TypeError('{!r} is not a callable object'.format(obj))
TypeError: None is not a callable object
51

JIT Classes

_images/aos_to_soa.png
52

JIT Class AoS to SoA example (1/3)

Original AoS layout using a structured dtype:

dtype = [
    ('x', np.float64),
    ('y', np.float64),
    ('z', np.float64),
    ('w', np.int32)
]

aos = np.zeros(N, dtype)

@jit(nopython=True)
def set_x_aos(v):
    for i in range(len(v)):
        v[i]['x'] = i

set_x_aos(aos)
53

JIT Class SoA to AoS example (2/3)

vector_spec = [
    ('N', int32),
    ('x', float64[:]),
    ('y', float64[:]),
    ('z', float64[:]),
    ('w', int32[:])
]

@jitclass(vector_spec)
class VectorSoA(object):
    def __init__(self, N):
        self.N = N
        self.x = np.zeros(N, dtype=np.float64)
        self.y = np.zeros(N, dtype=np.float64)
        self.z = np.zeros(N, dtype=np.float64)
        self.w = np.zeros(N, dtype=np.int32)

soa = VectorSoA(N)
54

JIT Class SoA to AoS example (3/3)

# Example iterating over x with the AoS layout:

@jit(nopython=True)
def set_x_aos(v):
    for i in range(len(v)):
        v[i]['x'] = i

# Example iterating over x with the SoA layout:

@jit(nopython=True)
def set_x_soa(v):
    for i in range(v.N):
        v.x[i] = i
55

JIT Class guidelines

numba.errors.LoweringError: Failed at nopython
    (nopython mode backend)
Internal error:
TypeError: Can only insert i32* at [4] in
    {i8*, i8*, i64, i64, i32*, [1 x i64], [1 x i64]}:
    got float*
56

CFFI and Numba

Two modes:

57

CFFI / Numba demo

Note: this is an example of a general procedure to wrap a library and use it with Numba. The demo won't run without VML development files.

Accelerate from Continuum provides VML functions as ufuncs.

58

CFFI Guidelines

59

Other New Numba Features

60