Description of problem
In the following example, the number of threads is set to be 5. When running time is large, e.g. brian2.run(1000*brian2.second)
, all 5 threads run at the same time. When running time is low, e.g.
brian2.run(0.5*brian2.second)
, only one thread is active. I observed this effect through running “top” comment and monitoring the “%CPU” number. Is this a normal behavior?
Minimal code to reproduce problem
import brian2
import numpy as np
brian2.prefs.devices.cpp_standalone.openmp_threads = 5
brian2.set_device('cpp_standalone', build_on_run=False)
group = brian2.NeuronGroup(10, 'dv/dt = -v / (10*ms) : 1')
mon = brian2.StateMonitor(group, 'v', record=True)
brian2.run(0.5*brian2.second)
#brian2.run(1000*brian2.second)
brian2.device.build(run=False)
results = []
for idx in range(1000):
brian2.device.run(run_args={group.v: np.arange(10)*0.1})
results.append(mon.v[0])
Expected output (if relevant)
I was expecting the number of threads running at the same time is independent of the running time.
For this specific example, for large running time, enabling openmp with, e.g. 10 threads, is actually slower then without enabling it. What could be the reason?
Hi @DavidKing2020 I don’t think the first observation is correct, the problem is just that it is too fast for top
. E.g. if I run time ./main
(with 5 threads and run(0.5*brian2.second)
in my shell, I get
Executed in 12.76 millis fish external
usr time 53.70 millis 0.00 micros 53.70 millis
sys time 3.31 millis 274.00 micros 3.04 millis
Which shows that the user time (time spent executing on the CPU across all threads) is roughly 5 times the total execution time. This type of measurement is not very exact for short runs, though. But in general, multi-threading is only useful if you have large networks, so it is not surprising that it does not help in your example with 10 neurons. Roughly speaking, if your simulation takes time x for simulating the loop body (i.e. the update of v for all neurons) during a time step, then it takes c\cdot t + \frac{x}{t} for t threads, where c is a constant that denotes the overhead for creating a thread. For short/simple simulations, the overhead for creating threads is bigger than what you save by using threads. Your example is a bit extreme, since it does almost no work within the loop that gets parallelized. The generated code is (slightly simplified for clarity):
const double _lio_1 = exp(1.0f*(0.1 * (- dt))/ms);
#pragma omp parallel for schedule(static)
for(int _idx=0; _idx<_N; _idx++)
{
double v = _array_neurongroup_v[_idx];
const double new_v = _lio_1 * v;
_array_neurongroup_v[_idx] = new_v;
The relatively expensive exp
calculation is only done once for all neurons, and the loop body/each thread only does a single multiplication.
Hope that clears up a bit!
1 Like
Hi @ mstimberg. This is indeed the case. Thanks for the detailed explanation.
1 Like