First, we will look into Step 2. It is a nice exercise to write a script that splits a vector into m pieces with a balanced number of elements. Here is one suggestion for such a script, among many others:
def split_array(vector, n_processors):
# splits an array into a number of subarrays
# vector one dimensional ndarray or a list
# n_processors integer, the number of subarrays to be formed
n=len(vector)
n_portions, rest = divmod(n,n_processors) # division with remainder
# get the amount of data per processor and distribute the res on
# the first processors so that the load is more or less equally
# distributed
# Construction of the indexes needed for the splitting
counts = [0]+ [n_portions + 1 \
if p < rest else n_portions for p in range(n_processors)]
counts=numpy.cumsum(counts)
start_end=zip(counts[:-1],counts[1:]) # a generator
slice_list=(slice(*sl) for sl in start_end) # a generator comprehension...