Dask parallelize a short task that uses a large np.ndarray

Question

I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).

A for loop over a matrix M with many rows

for x in M:
    f(x)

takes about 5 times longer than parallelizing using multiprocessing

import multiprocessing

with multiprocessing.Pool() as pool:
    pool.map(f, M)

I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.

So the question is how to use dask with short tasks that take large data as input?

you say you tried Dask, but don't show your attempt! additionally, what operating system are you using? (Windows systems often cannot effectively use `multiprocessing`) — ti7, Mar 07 '22 at 19:17
The answer will depend on the process creation https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods . I assume your multiprocessing is using fork and dask is using forkserver. — mdurant, Mar 08 '22 at 14:12
based on the description of "short task" I would say don't use Multiprocessing. Use `numba` instead. — Aaron, Mar 08 '22 at 21:20

score 2 · Answer 1 · answered Mar 11 '22 at 20:18

Firstly, the dask documentation makes clear the following contraindications:

it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
dask multiprocessing should not generally be used if you can run distributed (i.e., always)
don't use a python loop over an array, you should vectorize

Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.

Dask parallelize a short task that uses a large np.ndarray

1 Answers1