Long compiling time for a complex network

First, cordial thanks for updating the brian2cuda to use latest brian2!

Description of problem

We are going after plasticity in multi-object network, and were hit apparently by brian2cuda slow compiling time. This issue is likely partially answered here Multiprocessing in standalone mode, poor speed-up - #5 by mstimberg . We have multiple linux environments (laptop, workstation, cluster, container). I would love to have your opinion how to make this faster. I am specifically considering trying a RAM disk, but if there are any other hints like compiler flags, brian2cuda options it would be really helpful.

I am asking this because in the example below we have 5 neuron groups and 29 synapse objects. We are going after 56 neuron groups 153 synapse objects.

Minimal code to reproduce problem

This example ran three times the same network with one parameter changing. The actual run times were 30,30 and 14 seconds for 0.2 s biological time, while the whole process took 8.4 hours (SIC!)

Here is the full std output in a text file. It contains a lot of information from brian2cuda, including a description of the network at the beginning of each simulation. The 2 first simulations were run in parallel.

brian2cuda_compiling.txt (348.3 KB)

Here if you search for text “Cortical Module initialization Done” you find approximately the point where the cuda process starts.

This particular example uses Python 3.14, Brian 2.10.1, brian2cuda 1.0b1, cuda toolkit 2.6 on wsl2/ubuntu with 64 GB RAM, multi-core laptop.

Full traceback of error (if relevant)

No errors, no warnings, runs like a charming snail :slight_smile:

Hi @sivanni ,

I think you are running out of RAM during compilation. Can’t say for sure, its been a while that I touched brian2cuda. But I do remember some cases where I had that problem. There are a few related issues on GitHub, most relevant is probably this one (not very informative other than that this issue is known :smiley: ): Investigate excessive RAM usage · Issue #119 · brian-team/brian2cuda · GitHub

So from a user perspective, I would:

  1. Looking at you log file I see very small neuron and synapses objects (order of 100 per object). Not sure if this is just for testing, how large will they be in the simulations you want to run? How is the simulation time if you use C++ Standalone instead? Brian2CUDA doesn’t parallelize over different objects unfortunately (could be done though, see Making use of concurrent kernel execution · Issue #65 · brian-team/brian2cuda · GitHub ). So the state updates of your ~100 objects are run sequentially anyways. So for many objects, C++ standalone might be just as fast? Depends on your model and the sizes of the individual objects of course.
  2. Check RAM usage during compilation and see whether that’s the issue. If you can move to a machine with more RAM for compilation, that could be an option. You could even compile on a machine without GPU by specifying the GPU architecture via prefs if I remember correctly.
  3. The problem is that Brian2CUDA generates one source file per neuron/synapses object, and compilation just gets much slower for many source files. I think source files are compiled in parallel. But if you are running out of RAM, compiling sequentially might actually be better (not great, but probably less than 8 hours …). I think to turn of parallel compilation, you can set `devices.cpp_standalone.extra_make_args_unix = ['-j1']` (I think that also propagates to nvcc compilation, worth trying).
  4. A workaround could be to try to reduce the number of objects in your model. That depends of course on the model itself. But e.g. instead of two neurongroups of same type with lets say two synapses objects, one with inh and one with exc synapses, you could define both neurongroups in one object and use subgroups (indexing) to define the pre and post subset. Same for the synapses object, it could consist of both exc. and inh. effect with scalar mutipliers and you set the scalar to 0 for a subset of the synapses. Its a bit hacky, but an option.

So these are just a few ideas off the top of my head. If you want to follow up on any of those and need some pointers, I can have another look and try to help out.

From a developer perspective, there are a few things that could be done for these cases, e.g. one could merge the source files. We had a student who did that manually at some point and it reduced compile time to 1/5 in our Mushroom Body benchmark: Optimize compile times · Issue #179 · brian-team/brian2cuda · GitHub . That issue also has a few other hints. One could also try to optimize the Makefile, or automatically move to sequential compilation when running out of RAM (though that would need some estimate of how much RAM is needed).

Dear @denisalevi ,

Thanks for your prompt response! I think I already made good progress:

  1. The true compiling time was not 8 hours, but 40 minutes (12 seconds per object). My laptop went to sleep (I know… )

  2. It was not a RAM issue

  3. When I moved compiling to CPU with flags:
    b2.prefs.devices.cuda_standalone.cuda_backend.detect_gpus = False

b2.prefs.devices.cuda_standalone.cuda_backend.compute_capability = 8.6

b2.prefs.devices.cuda_standalone.cuda_backend.detect_cuda = True

b2.prefs.devices.cuda_standalone.cuda_backend.gpu_id = 0

My compiling time dropped to some 2.5 seconds / object. I think we can well live with this :slight_smile: . So now the compile times are like 8 minutes.

If the IO is slow, might try RAMdisk, what do you think?

Note that the brian2cuda documentation at Configuring the CUDA backend — Brian2CUDA 1.0.0 documentation mentions changing prefs.devices.cuda_standalone.cuda_backend.runtime_version, but this did not register and gave an error.

Note2 that unlike mentioned in Brian2CUDA specific preferences — Brian2CUDA 1.0.0 documentation it did not matter whether the .detect_cuda was True or False, both seemed to compile as fast. I did not try to actually run the simulations with detect_cuda = False, though. With detect_cuda = True it ran smoothly.