What's New in High-Performance Python?
Graham Markall
@gmarkall
Graham Markall
@gmarkall
Two aspects of performance:
Some new tools / features for both of these:
concurrent.futures
def processSlice(id, sliceSize, delta):
sum = 0.0
for i in range(1 + id * sliceSize, (id + 1) * sliceSize + 1):
x = (i - 0.5) * delta
sum += 1.0 / (1.0 + x * x)
return sum
def execute(processCount):
n = 10000000
delta = 1.0 / n
sliceSize = n // processCount
with ProcessPoolExecutor(max_workers=processCount) as e:
results = [e.submit(processSlice, i, sliceSize, delta)
for i in range(processCount)]
pi = 4.0 * delta * sum(item.result() for item in results)
execute(1)
execute(2)
execute(8)
execute(32)
Produces CPU usage summary:
PyCFunction_Call
or methods ending with
_Eval
When is VTune the tool to use?
@jit(nopython=True, nogil=True)
with nogil:
/ cdef ... nogil:
groupby
- and others(?)Including:
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
CPython | 1x |
Numpy array-wide operations | 13x |
Numba (CPU) | 120x |
Numba (NVidia Tesla K20c) | 2100x |
@vectorize([float64(float64, float64)])
def rel_diff_serial(x, y):
return 2 * (x - y) / (x + y)
@vectorize(([float64(float64, float64)]), target='parallel')
def rel_diff_parallel(x, y):
return 2 * (x - y) / (x + y)
For 10^8 elements, on my laptop (i7-2620M, 2 cores + HT):
%timeit rel_diff_serial(x, y)
# 1 loop, best of 3: 556 ms per loop
%timeit rel_diff_parallel(x, y)
# 1 loop, best of 3: 272 ms per loop
target='parallel'
or target=cuda
to @vectorize
decorator@vectorize(target='parallel')
)@vectorize([args], target='parallel')
Dispatch based on argument:
1-norm for scalar, vector and matrix:
def scalar_1norm(x):
'''Absolute value of x'''
return math.fabs(x)
def vector_1norm(x):
'''Sum of absolute values of x'''
return np.sum(np.abs(x))
def matrix_1norm(x):
'''Max sum of absolute values of columns of x'''
colsums = np.zeros(x.shape[1])
for i in range(len(colsums)):
colsums[i] = np.sum(np.abs(x[:, i]))
return np.max(colsums)
JITting into a single function using @generated_jit
:
def bad_1norm(x):
raise TypeError("Unsupported type for 1-norm")
@generated_jit(nopython=True)
def l1_norm(x):
if isinstance(x, types.Number):
return scalar_1norm
if isinstance(x, types.Array) and x.ndim == 1:
return vector_1norm
elif isinstance(x, types.Array) and x.ndim == 2:
return matrix_1norm
else:
return bad_1norm
Calling the generated function:
# Calling
x0 = np.random.rand()
x1 = np.random.rand(M)
x2 = np.random.rand(M * N).reshape(M, N)
l1_norm(x0)
l1_norm(x1)
l1_norm(x2)
# TypeError("Unsupported type for 1-norm")
l1_norm(np.zeros((10, 10, 10))
numba.types
to see types and attributesArray
, Number
, Integer
, Float
, List
ndim
, array dtype
, tuple dtype
or
types
Buffer
is the base for a lot of things, including Array
None
:File "/home/pydata/anaconda3/envs/pydata/lib/python3.5/inspect.py", line 2156,
in _signature_from_callable
raise TypeError('{!r} is not a callable object'.format(obj))
TypeError: None is not a callable object
Original AoS layout using a structured dtype:
dtype = [
('x', np.float64),
('y', np.float64),
('z', np.float64),
('w', np.int32)
]
aos = np.zeros(N, dtype)
@jit(nopython=True)
def set_x_aos(v):
for i in range(len(v)):
v[i]['x'] = i
set_x_aos(aos)
vector_spec = [
('N', int32),
('x', float64[:]),
('y', float64[:]),
('z', float64[:]),
('w', int32[:])
]
@jitclass(vector_spec)
class VectorSoA(object):
def __init__(self, N):
self.N = N
self.x = np.zeros(N, dtype=np.float64)
self.y = np.zeros(N, dtype=np.float64)
self.z = np.zeros(N, dtype=np.float64)
self.w = np.zeros(N, dtype=np.int32)
soa = VectorSoA(N)
# Example iterating over x with the AoS layout:
@jit(nopython=True)
def set_x_aos(v):
for i in range(len(v)):
v[i]['x'] = i
# Example iterating over x with the SoA layout:
@jit(nopython=True)
def set_x_soa(v):
for i in range(v.N):
v.x[i] = i
@jit
function_
or __
not supported yet - see PR #1851np.int32
, assigning np.float64
:numba.errors.LoweringError: Failed at nopython
(nopython mode backend)
Internal error:
TypeError: Can only insert i32* at [4] in
{i8*, i8*, i64, i64, i32*, [1 x i64], [1 x i64]}:
got float*
Two modes:
Note: this is an example of a general procedure to wrap a library and use it with Numba. The demo won't run without VML development files.
Accelerate from Continuum provides VML functions as ufuncs.
register_module
register_type
to tell Numba how to map the typeffi.from_buffer
does not type checkInterval
class):