Multiprocessing in standalone mode, poor speed-up

Description of problem

I am trying to simulate multiple networks in parallel in standalone mode, but the speedup I get is quite poor. Is there anything that can be done about it?

Minimal code to reproduce problem

Here is a toy example, with a setup similar to a previous issue:
https://brian.discourse.group/t/multiprocessing-in-standalone-mode/142
I use brian 2.4.1 on ubuntu 18.04.5

import joblib
import brian2 as br
import numpy as np
import time
import shutil
 
def worker(params):
     core_id = params[0]            
     directory = "standalone" + str(core_id)
     br.set_device('cpp_standalone', directory=directory)
     
     tau = params[1]*br.ms
     G = br.NeuronGroup(1, 'dv/dt = -v/tau : 1', method='euler')
     G.v = 1
     mon = br.StateMonitor(G, 'v', record=0)
     net = br.Network()
     net.add(G, mon)
     net.run(1000 * br.second)
     res = (mon.t/br.ms, mon.v[0])
     
     br.device.reinit()
     return(res)
 
 n_jobs = 2  #number of networks to simulate
 tau_values_0 = np.arange(n_jobs) + 5 #create parameters for these networks
 
 if __name__ == "__main__":
     with joblib.Parallel(n_jobs=n_jobs) as parallel: #change below to n_jobs = 1 for sequential plot
         start = time.time()
         res = parallel(joblib.delayed(worker)([i, tau_values_0[i]]) for i in range(n_jobs))
         print(str(round(time.time()-start,2)) + "s")
 
     # delete the directories created
     for i in range(n_jobs):
         path = "standalone" + str(i)
         shutil.rmtree(path)

Actual output

Below is a plot of the execution time as a function of the number of networks, either simulated in parallel (code above for various n_jobs), or sequentially.
The simulation time in sequential mode is linear in the number of networks, which is ok. The simulation time in parallel mode is sublinear, which is nice, but I still find it scales poorly. The computer used to generate the plots has 4 cores / 8 threads. I’m especially surprised that it takes substantially longer to run e.g. 3 vs 2 networks, or 6 vs 5 for which case there should be a free core for joblib to use without any increase in execution time.

plot_brian2_parallelVSsequential

Is there anything that I am doing wrong?
Thank you in advance for any help.

Hi. Parallelization is always a tricky issue. I think the speed-up you are seeing already looks pretty good to be honest. But I get that you were expecting a basically flat curve until 4 networks. I think you would almost get something like that if all the simulation does is using the CPU (even then, there’s always a tiny overhead from running multiple processes, though). This is not quite the case, though: in the beginning the code is generated/compiled, and in the end all results are written to disk (this is how Python accesses the results from standalone mode). Both of these operations make quite heavy use of your hard disk and will therefore be slower the more processes run in parallel. In a toy example like this, these operations are probably taking up a significant time of the total run time.
If you want to check this, you can have a look at the device._last_run_time attribute which stores the simulation time as measured during the simulation itself, i.e. without code generation/compilation. You could also run it in profiling mode with net.run(...profile=True) and then have a look at the profiling_summary – the shown entries are all the operations that actually run on the CPU, i.e. without the write-to-disk part in the end. But note that things will run slower than normal in this mode. Finally, all this should matter less and less if your model is more complex and does more computations during each time step, since the compilation + write-to-disk time is almost independent of model complexity.

Hi, thanks for the useful reply.

Below is the updated graph of the previous post plotting an average of device._last_run_time. As you were explaining above, the time for the simulation itself is almost constant.

plot_brian2_parallelVSsequential_withDeviceTime

However, when using a more complex model (the Vogels 2011 network from the examples, no monitors and nothing returned by the worker), instead of scaling better, if anything it’s worse (see next post, i can’t include several plots in a single post apparently)

So I guess my question is do you have any tricks to improve this part of the execution:

aka, any chance I can bring the blue line closer to the dotted blue line?
(Ultimately I will need to use a hundred networks/cores)

Thank you in advance for your help.

plot_brian2_parallelVSsequential_withDeviceTime_bigNet

Unfortunately there is not much general advice I can give, all depends on the details of the model. The compilation has a more or less fixed cost (it basically depends on the number of objects), and does not depend on the size of the network, the complexity of the model, or the length of the simulation. The time it takes therefore becomes negligible when you scale up the network or simulate it for a longer time. If you have a small/simple/short-running network where the compilation takes a large part of the time, you might even be better off by not using code generation completely (i.e. without set_device and with prefs.codegen.target='numpy')!
We actually discussed this a bit in our Brian2GeNN paper where this is even more of an issue since the code generation for GeNN + CUDA compilation takes a very long time.

Having said all that, the most important approach to reduce the compilation time in standalone mode is to reduce the number of objects/simulations. In the first toy example, you could obviously simulate all the neurons with different time constants in a single network and this would only need a single compilation. In the Vogels 2011 network there is a more subtle issue. If you used it as it is with the two run statements, these will effectively double the number of objects it has to compile. It would be more efficient for the compilation time to change the eta variable with a run_regularly operation, for example.

Finally, some Brian-independent approaches to decrease the compilation time might work. For example, using less aggressive optimization (set as part of the preferences) should make compilation a bit faster (but potentially simulation a bit slower). If you have a lot of memory, you could also try to point the code directory to a ramdisk which should be considerably faster. Finally, you could do a single compilation first and copy over the directory of this to the directory you create for your new process. It will have to recompile some parts depending on what the difference between the compilations is, but most of the files will be unchanged and will therefore be skipped. You can also automate this by using a tool like ccache. If you try this out, please let us know about your experience!

Thanks a lot for the useful advice, I will try these things out and report back to you if anything works!

1 Like