Number of running openmp threads dependence on running timing

DavidKing2020 · 3 December 2024 20:26

Description of problem

In the following example, the number of threads is set to be 5. When running time is large, e.g. brian2.run(1000*brian2.second), all 5 threads run at the same time. When running time is low, e.g.
brian2.run(0.5*brian2.second), only one thread is active. I observed this effect through running “top” comment and monitoring the “%CPU” number. Is this a normal behavior?

Minimal code to reproduce problem

 import brian2
 import numpy as np

 brian2.prefs.devices.cpp_standalone.openmp_threads = 5
 brian2.set_device('cpp_standalone', build_on_run=False)
 group = brian2.NeuronGroup(10, 'dv/dt = -v / (10*ms) : 1')
 mon = brian2.StateMonitor(group, 'v', record=True)
 brian2.run(0.5*brian2.second)
 #brian2.run(1000*brian2.second)
 brian2.device.build(run=False)
 results = []
 for idx in range(1000):
     brian2.device.run(run_args={group.v: np.arange(10)*0.1})
     results.append(mon.v[0])

Expected output (if relevant)

I was expecting the number of threads running at the same time is independent of the running time.

DavidKing2020 · 3 December 2024 20:32

For this specific example, for large running time, enabling openmp with, e.g. 10 threads, is actually slower then without enabling it. What could be the reason?

mstimberg · 4 December 2024 15:49

Hi @DavidKing2020 I don’t think the first observation is correct, the problem is just that it is too fast for top. E.g. if I run time ./main (with 5 threads and run(0.5*brian2.second) in my shell, I get

Executed in   12.76 millis    fish           external
   usr time   53.70 millis    0.00 micros   53.70 millis
   sys time    3.31 millis  274.00 micros    3.04 millis

Which shows that the user time (time spent executing on the CPU across all threads) is roughly 5 times the total execution time. This type of measurement is not very exact for short runs, though. But in general, multi-threading is only useful if you have large networks, so it is not surprising that it does not help in your example with 10 neurons. Roughly speaking, if your simulation takes time x for simulating the loop body (i.e. the update of v for all neurons) during a time step, then it takes c\cdot t + \frac{x}{t} for t threads, where c is a constant that denotes the overhead for creating a thread. For short/simple simulations, the overhead for creating threads is bigger than what you save by using threads. Your example is a bit extreme, since it does almost no work within the loop that gets parallelized. The generated code is (slightly simplified for clarity):

    const double _lio_1 = exp(1.0f*(0.1 * (- dt))/ms);
    #pragma omp parallel for schedule(static)
    for(int _idx=0; _idx<_N; _idx++)
    {
        double v = _array_neurongroup_v[_idx];
        const double new_v = _lio_1 * v;
        _array_neurongroup_v[_idx] = new_v;

The relatively expensive exp calculation is only done once for all neurons, and the loop body/each thread only does a single multiplication.

Hope that clears up a bit!

DavidKing2020 · 4 December 2024 16:31

Hi @ mstimberg. This is indeed the case. Thanks for the detailed explanation.

Topic		Replies	Views
Python runs on only one core Support	2	604	29 April 2021
Run in standalone mode and multithreading Support	5	751	24 November 2021
Error while using brian2 parallelization on Linux cluster Support parallel , compiler	3	44	19 November 2024
Using brian2cuda and cpp_stardalone for a brian2 script with loop of runs Support brian2cuda	6	627	13 July 2022
Multiprocessing in standalone mode, poor speed-up Support performance	5	754	8 October 2020

Number of running openmp threads dependence on running timing

Description of problem

Minimal code to reproduce problem

Expected output (if relevant)

Related topics