Number of running openmp threads dependence on running timing

Description of problem

In the following example, the number of threads is set to be 5. When running time is large, e.g. brian2.run(1000*brian2.second), all 5 threads run at the same time. When running time is low, e.g.
brian2.run(0.5*brian2.second), only one thread is active. I observed this effect through running “top” comment and monitoring the “%CPU” number. Is this a normal behavior?

Minimal code to reproduce problem

 import brian2
 import numpy as np

 brian2.prefs.devices.cpp_standalone.openmp_threads = 5
 brian2.set_device('cpp_standalone', build_on_run=False)
 group = brian2.NeuronGroup(10, 'dv/dt = -v / (10*ms) : 1')
 mon = brian2.StateMonitor(group, 'v', record=True)
 brian2.run(0.5*brian2.second)
 #brian2.run(1000*brian2.second)
 brian2.device.build(run=False)
 results = []
 for idx in range(1000):
     brian2.device.run(run_args={group.v: np.arange(10)*0.1})
     results.append(mon.v[0])

Expected output (if relevant)

I was expecting the number of threads running at the same time is independent of the running time.

For this specific example, for large running time, enabling openmp with, e.g. 10 threads, is actually slower then without enabling it. What could be the reason?

Hi @DavidKing2020 I don’t think the first observation is correct, the problem is just that it is too fast for top. E.g. if I run time ./main (with 5 threads and run(0.5*brian2.second) in my shell, I get

Executed in   12.76 millis    fish           external
   usr time   53.70 millis    0.00 micros   53.70 millis
   sys time    3.31 millis  274.00 micros    3.04 millis

Which shows that the user time (time spent executing on the CPU across all threads) is roughly 5 times the total execution time. This type of measurement is not very exact for short runs, though. But in general, multi-threading is only useful if you have large networks, so it is not surprising that it does not help in your example with 10 neurons. Roughly speaking, if your simulation takes time x for simulating the loop body (i.e. the update of v for all neurons) during a time step, then it takes c\cdot t + \frac{x}{t} for t threads, where c is a constant that denotes the overhead for creating a thread. For short/simple simulations, the overhead for creating threads is bigger than what you save by using threads. Your example is a bit extreme, since it does almost no work within the loop that gets parallelized. The generated code is (slightly simplified for clarity):

    const double _lio_1 = exp(1.0f*(0.1 * (- dt))/ms);
    #pragma omp parallel for schedule(static)
    for(int _idx=0; _idx<_N; _idx++)
    {
        double v = _array_neurongroup_v[_idx];
        const double new_v = _lio_1 * v;
        _array_neurongroup_v[_idx] = new_v;

The relatively expensive exp calculation is only done once for all neurons, and the loop body/each thread only does a single multiplication.

Hope that clears up a bit!

1 Like

Hi @ mstimberg. This is indeed the case. Thanks for the detailed explanation.

1 Like