Multiple run in standalone mode

Ziaeemehr · 6 September 2020 16:26

For multiple run using standalone mode I used tutorial part 3 and multiple run and an example from issue #1189:

In each run the time constant tau is changed and simulation run for 1 second.
‍‍‍‍‍‍

def simulate(tau):
    b2.start_scope()

    if standalone_mode:
        b2.get_device().reinit()
        b2.get_device().activate(build_on_run=False,
                                 directory=directory_name)

    net = b2.Network()
    P = b2.PoissonGroup(num_inputs, rates=input_rate)
    G = b2.NeuronGroup(1, eqs, threshold='v>1', reset='v=0', method='euler')
    S = b2.Synapses(P, G, on_pre='v += weight')
    S.connect()
    M = b2.SpikeMonitor(G)
    net.add(P)
    net.add(G)
    net.add(S)
    net.add(M)

    net.run(1000 * b2.ms)
    
    if standalone_mode:
        b2.get_device().build(directory=directory_name,
                              compile=True,
                              run=True,
                              debug=False)
    return M

The question is that Why I am not getting sensible speedup compared to not using standalone mode?
Am I doing something wrong?

To run the simulation :

    start_time = time()
    num_inputs = 100
    input_rate = 10 * b2.Hz
    weight = 0.1
    npoints = 15
    tau_range = np.linspace(1, 15, npoints) * b2.ms
    eqs = '''
    dv/dt = -v/tau : 1
    '''

    output_rates = np.zeros(npoints)
    for ii in range(npoints):

        tau_i = tau_range[ii]
        M = simulate(tau_i)
        output_rates[ii] = M.num_spikes / b2.second

    print("Done in {}".format(time()-start_time))

mstimberg · 7 September 2020 13:55

The total simulation time for a simulation that uses code generation consists of the time for code generation + compilation, and the time of the actual simulation run. The standalone mode accelerates the simulation run, but the code generation + compilation time is the same or even bigger than for runtime mode, because runtime mode does some more advanced caching (and you can avoid the compilation overhead completely by using the store/restore mechanism).
In your example, the simulation is very simple and the simulation run takes very little time, much less than the time to generate/compile the code. Here, the fastest solution (ignoring the store/restore approach) would actually be to switch off code compilation completely by using b2.prefs.codegen.target = 'numpy' – the gain from not having to compile any code will outweigh the slower simulation time. This would change if the simulation itself were more complicated, or simply longer.

There is a way to do this kind of simulation where only a parameter changes between runs, without rebuilding the network and recompiling the code in standalone. Unfortunately, it is not really exposed by Brian’s API at the moment and needs some knowledge about internals. If you want to dive deeper into this, you can have a look at an example in a gist where I demonstrate that technique.

In general, if you want maximal performance you should avoid using multiple runs. One solution is to make a single long run, where you change a parameter (and e.g. reset the membrane potential) periodically with mechanisms such as run_regularly and/or TimedArray. In this case, it would be even more efficient to “vectorize” your example: instead of using a single neuron in the NeuronGroup, you’d have 15 ones, each with a different tau parameter. You can then run a single simulation and get the exact same results.

rth · 7 September 2020 15:47

In this case, it would be even more efficient to “vectorize” your example: instead of using a single neuron in the NeuronGroup , you’d have 15 ones, each with a different tau parameter. You can then run a single simulation and get the exact same results.

I just want to add to Marcel’s answer, this is a pretty standard approach to a parameter search and it works great! In your case the code should look like this

num_inputs = 100
input_rate = 10 * b2.Hz
weight = 0.1
npoints = 15
tau_range = np.linspace(1, 15, npoints) * b2.ms
eqs = '''
    dv/dt = -v/tau : 1
    tau : second
'''
P = b2.PoissonGroup(num_inputs, rates=input_rate)
G = b2.NeuronGroup(npoints, eqs, threshold='v>1', reset='v=0', method='euler')
G.tau = tau_range
S = b2.Synapses(P, G, on_pre='v += weight')
S.connect()
M = b2.SpikeMonitor(G)
b2.run(1000 * b2.ms)
for ii in range(npoints):
   print( ii, float(len(np.where(M.i == ii)[0]) )/b2.second )

That is all.

Ziaeemehr · 7 September 2020 18:48

Thank you @mstimberg and @rth. The answer is quite clear.
I am considering this as a toy example to use the same method for real problems, and in real problem probably the vectorization does not work.

What I got from the example at the provided link is that it change a parameter in the C++ code without need to compile again. Some thing like giving a parameter to the code from terminal without compiling for every run.

$ ./executable tau

int main(char **argv)
double tau = atof(argv[1]);
...

Does it also work for more that one parameter?
I know a bit C++ but not familiar with internal code of the Brian.
Is it straightforward to track some parameters in the C++ code and feed their values to the code from the terminal and just use an script to run the code many times?
I suppose that the network and everything remain the same as before and only a few parameters change in each run.

rth · 7 September 2020 23:42

I don’t see any problems here. I understand your confusion, I code in C/C++ for many years, and it was a bit hard to adjust the way of thinking.

You are completely right. In C/C++ we usually assume one process - one parameter set. We use bash (or any other sh) to run many instances of the same program and supply parameters as command-line arguments. C-code is compiled once and then you use it many times, but python is an interpreter, pretty clever, but still an interpreter. So it builds the code on the way and this adds time to overall performance. As Marcel said, if your calculations are heavy enough to keep CPU busy longer than the compilation, you can ignore compilation time. In this case, you can build an application which runs only one one parameter set:

import sys
if len(sys.argv) < 2 :
  print("USAGE: program tau")
  exit(1)
try:
  tau = float(sys.argv[1])
except:
  raise ValueError(" Cannot convert {} into float".format(sys.argv[1]) )

and launch this code in the same way as your c-code python single-run.py tau

If a single parameter evaluation is lite (like in your example), the compilation will take longer than computations themselves. So if you want to minimize overall time - you need to pack more parameter evaluations in one run. This is quite easy to implement

import sys
if len(sys.argv) < 4 :
  print("USAGE: program tau-min tua-max number")
  exit(1)
taumin,taumax,npoints = sys.argv[1:4]
try:
  taumin = float(taumin)
except:
  raise ValueError(" Cannot convert {} into float".format(taumin) )
try:
  taumax = float(taumax)
except:
  raise ValueError(" Cannot convert {} into float".format(taumax) )
try:
  npoints = int(npoints)
except:
  raise ValueError(" Cannot convert {} into int".format(npoints) )
tau_range = np.linspace(taumin, taumax, npoints) * b2.ms

Now you can run this program as python example.py 1. 15. 15

mstimberg · 8 September 2020 14:18

Adding to @rth’s comments. If you have many short simulations that you cannot vectorize then it would indeed be useful to run the same standalone simulation repeatedly and only change one or several parameters. With the code I linked to you can do this from a single script (as shown in the code), but the same approach would in principle also work if you run every new parameter in its own script. The details are a bit more complicated unfortunately and I am not sure it is worth it.

We plan to make these kind of things easier in the future (see e.g. this github project), but progress is quite slow unfortunately.

Ziaeemehr · 8 September 2020 15:57

Awesome!, Thank you, That’s what I wanted.

Jul94 · 12 February 2021 13:00

My case is a bit different from this one but also involves multiple runs in the standalone mode. The code is a slightly modified version of the Brian2 and Python3 port of the paper of Diehl&Cook2015, “Unsupervised learning of digit recognition using spike-timing-dependent plasticity”, which has been mentioned a few times in this forum (see this and this). However, I think a general outline of the code will suffice:

set_device('cpp standalone', build_on_run=False)

# Define auxiliary functions

# Load dataset

# Set parameters and equations

# Create network and connections

# Run the simulation
net = Network()
for obj_list in [neuron_groups, input_groups, connections, rate_monitors, 
spike_monitors, spike_counters]:
    for key in obj_list:
        net.add(obj_list[key])
...
# loop until the whole dataset has been covered
while j < (num_examples):
  # several run() calls for resting and presentation times

device.build(compile=True, run=True, debug=True, clean=False)

# Save results

# Plot results

device.delete(code=False)

So the key idea here is that we have a (large) network and run it in a loop, but not because we are trying different parameters/network configurations, but because we are presenting the different input patterns to the network.

When trying to run the simulation in standalone mode, I noticed the execution speed gradually slowed down as the simulation progressed. I understand that is due to what @mstimberg explained in a different thread:

Strange redefinition error upon Brian upgrade to version 2.4+git: handling of multiple runs in standalone not consistent with previous Brian versions

The problem is the compilation stage. What standalone basically does to support multiple runs, is that it generates the whole network code over and over again for each run (one reason is that external constants are hardcoded into the code, so to take this into account we have to write the code again). For complex networks this means that there is a potentially huge number of code files that need to be compiled. This not only takes a long time, but depending on the compiler and OS it can also simply run into a “command line too long issue”. Finally, on linux we run make -j , which means that it uses independent parallel processes to compile all the source files. If there are many of them, this can take a lot of memory and make things crash. I would therefore consider changing the devices.cpp_standalone.extra_make_args_unix preference to something like ['-j', '8'] (for 8 processes, adapt this to the number of processors on your machine). I thought we had already changed this, but apparently not…

Finally, there’s a much faster technique where you only compile a single code and then change parameters “under the hood”, but this approach is not yet exposed by Brian. If you are interested in that approach you should have a look at this gist .

I have yet to try the approach from that gist (at first I thought it’s not exactly the same situation, but it might work nonetheless), but on a more general note, I am wondering what is the best workflow in that scenario. That is, what is the best approach when you want to speed up a simulation where you feed an entire dataset to a large and fairly complex network (without any other change of parameters)? Any ideas?

Note: This is my first attempt at implementing a complex simulation in Brian, so I have still quite a few things to try and learn. For instance, one of my next steps will be to use a proper event-based dataset (e.g. N-MNIST) instead of standard MNIST, I think that might simplify some things about the original structure of the code. But of course any other ideas and advice is really appreciated!

mstimberg · 16 February 2021 14:07

If the speed goes down while the simulation is already running, then this is not the compilation issue, since it is no longer compiling files at that point. There are two main reasons why a simulation can slow down at some point: 1) your simulation is filling up the memory of your machine (in particular things like StateMonitor recording a variable for every neuron at every time step) and it therefore starts using swap space on the harddisk; 2) your simulation is generating many more spikes and therefore becomes more computationally demanding.

Regarding the thread that you quoted and the general question of running loops in standalone mode (regardless of whether the loops are over parameters or inputs): if you need to do this for many loop iterations and/or in a “programmatic” way (e.g. “if this input does not evoke any spikes, increase the firing rate and present it again”), then you are probably better off to use runtime instead of standalone mode. It is a little slower in general, but you are most likely saving a lot of time by compiling less. Also, you can make use of store/restore to reset the simulation to an initial state instead of running the simulation without input to make its variables go back to their resting state.
If you want to stay with standalone mode, then the general recommendation is to try to use the smallest number of run calls possible. You can often reduce the number quite significantly by using a combination of TimedArray and expressions referring to time. To give a rough example, instead of doing this:

stimulus_rate = [100*Hz, 200*Hz, 50*Hz, 100*Hz]
n_stimuli = len(stimulus_rate)
input_spikes = PoissonGroup(100, rates=100*Hz)
for idx in range(n_stimuli):
    input_spikes.rates = stimulus_rate[idx]
    run(200*ms)
    input_spikes.rates = 0*Hz
    run(100*ms)

you could do something like this:

stimulus = TimedArray(stimulus_rate, dt=300*ms)
input_spikes = PoissonGroup(100,
                            rates='stimulus(t)*int((t % 300*ms) <= 200*ms)')
run(n_stimuli*300*ms)

mstimberg · 22 February 2021 10:57

Hi. I think you might have replied to the wrong thread? This was about Brian-related questions how to run multiple simulations, etc. Maybe you could repost your questions to the topic you opened earlier: Unsupervised learning of digit recognition using spike-timing-dependent , there might be someone who either knows the answers or would be willing to work with you to figure them out. For very specific questions, you could also try to contact the authors of the article and/or the code.

alikhaledi · 18 December 2022 21:41

@mstimberg Is it possible to use both multiprocessing (such as here brian2/examples/multiprocessing at master · brian-team/brian2 · GitHub) as well as do repeated execution of a standalone simulation with changed parameters without recompilation (in here Brian 2: repeated execution of a standalone simulation with changed parameters without recompilation · GitHub)

Basically, I am trying to change two parameters, one through multiprocessing and one through repeated execution of a standalone simulation. Do you have an example of that?

mstimberg · 19 December 2022 13:04

Hi @alikhaledi,
there is a new mechanism for doing repeated/parallel standalone simulations, see see the discussion here: Running multiple standalone simulations [Request for Feedback] - #21 by mstimberg It is not released yet, but you can install and test it directly from the branch. Does that do what you are interested in? If not, please reply over in that discussion thread.

Topic		Replies	Views
Multiple epoch of running in standalone mode Support	7	574	2 October 2020
Running multiple standalone simulations [Request for Feedback] Development cpp_standalone , rff	38	1185	23 June 2025
Multiprocessing in standalone mode, poor speed-up Support performance	5	753	8 October 2020
Add stand alone mode with multiple network.run() Support	6	238	15 October 2022
Run in standalone mode and multithreading Support	5	742	24 November 2021

Multiple run in standalone mode

Related topics