What's New in High-Performance Python?

Graham Markall

@gmarkall

1

Hello! (About Me)

2

Overview

Two aspects of performance:

Some new tools / features for both of these:

3

My existing profiling toolbox

_images/gprof2dot.png
4

Accelerate Data Profiling

5

Accelerate Data Profiling Demo

6

Data Profiling Guidelines

7

Intel VTune

8

VTune short example

def processSlice(id, sliceSize, delta):
    sum = 0.0
    for i in range(1 + id * sliceSize, (id + 1) * sliceSize + 1):
        x = (i - 0.5) * delta
        sum += 1.0 / (1.0 + x * x)
    return sum

def execute(processCount):
    n = 10000000
    delta = 1.0 / n
    sliceSize = n // processCount
    with ProcessPoolExecutor(max_workers=processCount) as e:
        results = [e.submit(processSlice, i, sliceSize, delta)
                   for i in range(processCount)]
        pi = 4.0 * delta * sum(item.result() for item in results)
9

Session setup

execute(1)
execute(2)
execute(8)
execute(32)
_images/sessionsetup.png
10

Basic hotspots analysis

Produces CPU usage summary:

_images/cpuusage.png
11

VTune CPU Timeline

_images/cputimeline.png
12

VTune function summary

_images/functionsummary.png
13

VTune Python functions in summary

_images/pythonfunctions.png
14

VTune Guidelines

When is VTune the tool to use?

15

New Numba Features (0.18 - 0.25)

Including:

16

Quick Numba intro

from numba import jit

@jit
def mandel(x, y, max_iters):
    c = complex(x,y)
    z = 0j
    for i in range(max_iters):
        z = z*z + c
        if z.real * z.real + z.imag * z.imag >= 4:
            return 255 * i // max_iters

    return 255
CPython 1x
Numpy array-wide operations 13x
Numba (CPU) 120x
Numba (NVidia Tesla K20c) 2100x
17

Parallel & CUDA ufuncs / gufuncs

@vectorize([float64(float64, float64)])
def rel_diff_serial(x, y):
     return 2 * (x - y) / (x + y)

@vectorize(([float64(float64, float64)]), target='parallel')
def rel_diff_parallel(x, y):
    return 2 * (x - y) / (x + y)

For 10^8 elements, on my laptop (i7-2620M, 2 cores + HT):

%timeit rel_diff_serial(x, y)
# 1 loop, best of 3: 556 ms per loop

%timeit rel_diff_parallel(x, y)
# 1 loop, best of 3: 272 ms per loop
18

Parallel / CUDA (g)ufunc guidelines

19

Generated functions

Dispatch based on argument:

20

Generated function example: (1/3)

1-norm for scalar, vector and matrix:

def scalar_1norm(x):
    '''Absolute value of x'''
    return math.fabs(x)

def vector_1norm(x):
    '''Sum of absolute values of x'''
    return np.sum(np.abs(x))

def matrix_1norm(x):
    '''Max sum of absolute values of columns of x'''
    colsums = np.zeros(x.shape[1])
    for i in range(len(colsums)):
        colsums[i] = np.sum(np.abs(x[:, i]))
    return np.max(colsums)
21

Generated function example (2/3)

JITting into a single function using @generated_jit:

def bad_1norm(x):
    raise TypeError("Unsupported type for 1-norm")

@generated_jit(nopython=True)
def l1_norm(x):
    if isinstance(x, types.Number):
        return scalar_1norm
    if isinstance(x, types.Array) and x.ndim == 1:
        return vector_1norm
    elif isinstance(x, types.Array) and x.ndim == 2:
        return matrix_1norm
    else:
        return bad_1norm
22

Generated function example (3)

Calling the generated function:

# Calling

x0 = np.random.rand()
x1 = np.random.rand(M)
x2 = np.random.rand(M * N).reshape(M, N)

l1_norm(x0)
l1_norm(x1)
l1_norm(x2)

# TypeError("Unsupported type for 1-norm")
l1_norm(np.zeros((10, 10, 10))
23

Generated functions guidelines

File "/home/pydata/anaconda3/envs/pydata/lib/python3.5/inspect.py", line 2156,
         in _signature_from_callable
    raise TypeError('{!r} is not a callable object'.format(obj))
TypeError: None is not a callable object
24

JIT Classes

_images/aos_to_soa.png
25

JIT Class AoS to SoA example (1/3)

Original AoS layout using a structured dtype:

dtype = [
    ('x', np.float64),
    ('y', np.float64),
    ('z', np.float64),
    ('w', np.int32)
]

aos = np.zeros(N, dtype)

@jit(nopython=True)
def set_x_aos(v):
    for i in range(len(v)):
        v[i]['x'] = i

set_x_aos(aos)
26

JIT Class SoA to AoS example (2/3)

vector_spec = [
    ('N', int32),
    ('x', float64[:]),
    ('y', float64[:]),
    ('z', float64[:]),
    ('w', int32[:])
]

@jitclass(vector_spec)
class VectorSoA(object):
    def __init__(self, N):
        self.N = N
        self.x = np.zeros(N, dtype=np.float64)
        self.y = np.zeros(N, dtype=np.float64)
        self.z = np.zeros(N, dtype=np.float64)
        self.w = np.zeros(N, dtype=np.int32)

soa = VectorSoA(N)
27

JIT Class SoA to AoS example (3/3)

# Example iterating over x with the AoS layout:

@jit(nopython=True)
def set_x_aos(v):
    for i in range(len(v)):
        v[i]['x'] = i

# Example iterating over x with the SoA layout:

@jit(nopython=True)
def set_x_soa(v):
    for i in range(v.N):
        v.x[i] = i
28

JIT Class guidelines

numba.errors.LoweringError: Failed at nopython
    (nopython mode backend)
Internal error:
TypeError: Can only insert i32* at [4] in
    {i8*, i8*, i64, i64, i32*, [1 x i64], [1 x i64]}:
    got float*
29

CFFI and Numba

Two modes:

30

CFFI / Numba demo

Note: this is an example of a general procedure to wrap a library and use it with Numba. The demo won't run without VML development files.

Accelerate from Continuum provides VML functions as ufuncs.

31

CFFI Guidelines

32

Other New Numba Features

33

Further Reading / Information

34