Python crash or random computer reboot

NinaD · 12 March 2024 13:23

Hello! I am working with brian2 now for almost 3 years and I love it. Unfortunately I have some recurring problems since I started doing many parallel simulations. At first I was unsure whether this is a brian2 problem, but after trying so many other things and never running into problems with other programs, I think there definitely is a possiblity:

I run brian2 simulations of a neuronal network in parallel on the different cores of my computers (through some bash code distributing simulations with the correct parameters from a list). After a while (this could be 30 minutes, a few hours or a few days, the timing is very random) either python terminates itself or the computer reboots itself without any traceable problem in the logs. This happened many times on two different machines. The simulations then do not continue and I have to restart the processes. Sometimes there is a “Segmentation fault (core dumped)” message in the python logs, but not always. Here are some specs and things I have tried:

I work on multiple linux machines, the two which had the problems so far are:

one with x86_64 architecture with 20 CPU’s (12th Gen Intel Core i7-12700KF) with 64 GB RAM and Nvidia AD102 Geforce RTX 4090. I run 18 parallel simulations on this machine.
one with x86_64 architecture with 32 CPU’s (Intel Core i9-14900K) with 192 GB RAM and Raptor Lake-S GT1 UHD Graphics card 770. I run 30 parallel simulations on this machine.

Simulations are run with Cython and every simulation of 3 minutes takes about 4 minutes. On the second computer, when 30 cores are active running simulations, 100% of those cores are in use but only 30 GB of the total RAM is used.

I tried:

Memtest, both computers passed, there does not seem to be a memory problem
stress test, both computers passed
monitoring heat, both computers maintain normal temperatures even when running for weeks.
clean rocky linux installation, clean python installation, clean brian2 installation, clean gcc installation etc.

Besides this I have the problem that the cython cache increases a lot with these many simulations, but building clear_cache(‘cython’) into the code causes errors because of the parallel simulations. Could it be that also other things in brian2 are slowly building up or causing memory problems after a random amount of time? Or do you have any other ideas about how and whether my way of working with brian2 could cause these problems? The type of models I run can be found here: Doorn, Nina (UT-TNW) · GitLab

Let me know whether I can provide any additional information!

Thanks in advance

mstimberg · 12 March 2024 16:56

Hi @NinaD. From what you state I would have guessed that it is a memory issue, but from your observations that does not quite seem to be the case. Also, if I understand correctly, you run each new simulation in a new process, i.e. even if Brian had some memory leak, this shouldn’t matter much since the memory would be freed every few minutes. A segmentation fault could be due to some error in Brian (or in Python/Cython, but less likely…), but I don’t quite see what would make the computer restart… And you’ve already checked the most important things that I’d have suggested. Maybe it is an issue in the bash script, e.g. it somehow spawns too many processes? Could you give some more information on how you launch the processes and what is the difference between the simulations that you run (i.e., what kind of parameters are you setting individually for each run?). And I assume all this was with the latest Brian version, i.e. 2.5.4?

Obviously it would be important to fix the issue you are seeing, but your use case also seems to be perfect for the new feature discussed here: Running multiple standalone simulations [Request for Feedback]
I will do a new Brian release later this week, and this feature will be included.

NinaD · 13 March 2024 09:33

Hi @mstimberg
Thank you for the quick response.

I should be more specific, there are two cases:

the first (that I do not use anymore) runs parallel simulations from python with the “simulate_for_sbi” function from Mackelabs sbi python package (sbi), where I put the num_workers to the amount of cores I use. The simulator then uses brian. In this case, the computers doesn’t reboot, but python terminates itself after some random amount of time.
in the second case, I use a way to run parallel simulations on multiple separate linux machines. The computers all ssh to a separate location, the main computer, where there is a list of simulations (with different parameters) that need to be done. They then take the top simulation and delete it from the list (of course there is some time management to make sure they do not interfere with each other). Every computer has a different amount of terminals (dependent on the amount of CPU’s they have) from which this ssh action takes place. The list item is passed to a bash script in every terminal which starts python with a pre-defined script and the correct parameters. In this way, I can efficiently use all the available different linux machines to perform the simulations I need. In this case, two of the computers sometimes reboot themselves after some time. This rebooting doesn’t happen when doing the same thing with non-brian2-related python scripts that do some arbritrary heavy computation like filling matrices.

The parameters are neuron, synapse and network characteristics such as conductances or connection probabilities. I use the simulations for parameter optimization (with sbi). The only thing temporarily saved to the computer are 9 output measures per simulation (so no time series or anything like that). These results are then send back to the main computer and deleted locally.

I have looked at the standalone mode before for speed up. However, I am unsure this will work because I do additional python computations to get the output measures. And I am unsure all the parts of my model will work with standalone mode (stochastic heterogeneous synapses with different synaptic delays etc.).

Yes I have the latest version of brian2

The reason I recently started thinking it might have something to do with brian2 is that I got the “segmentation fault (core dumped)” error again and found this: python - Error: Segmentation fault (core dumped) - Stack Overflow which suggests a third-party extension module working with C might be involved in such an error.

mstimberg · 15 March 2024 17:58

I see… This is a bit frustrating, since I don’t really know how to debug this. Regarding the reboots, it might be helpful to make logs persist over reboot (https://www.redhat.com/sysadmin/store-linux-system-journals) and then run something like journalctl -b -1 -n that should have the reason for the reboot somewhere in the log (if it is not something completely external like a power failure – but that shouldn’t be Brian-specific ).

I am still a bit intrigued about the fact that the Cython cache dir keeps growing. Normally this shouldn’t be the case, since the code should be reused. Could give (again, sorry) give more detail of how you set the parameters that are different between the simulations in your model? E.g., do you “hardcode” parameters into equations, or do you set them as external parameters (as in g_max = ...; group = NeuronGroup(..., "dg/dt = g_max*....")), or as parameters of your group (e.g. group = NeuronGroup(..., """g_max : siemens..."""); group.g_max = some_value)?

Do you mean computations during the run with something like network_operation? If you only do Python calculations after the run, then this would be compatible with standalone mode.

All this should work just fine with standalone mode. Standalone’s limitations are mostly related to whether you can express all initial values without basing them on things that haven’t been executed yet. I.e.

syn = Synapses(...)
syn.connect(..., p=0.1)
# do some calculation with len(syn)`

wouldn’t work in standalone mode, since the synapses will only be created when you run the simulation. You can workaround these limitations most of the time, but of course the devil is in the detail. Let me know if you want to try it out and need help with anything. For your use case, the new mode in the fresh-off-the-press 2.6.0 release to run identical simulations with different parameter values would be very relevant, I think: Computational methods and efficiency — Brian 2 2.6.0 documentation

Topic		Replies	Views
Apparent Crash in SpikeMonitor Support	5	520	5 October 2020
I evoked a Brian2 ERROR Development	0	22	20 August 2024
Problem clearing cython cache Support performance	4	859	13 December 2021
Brian2GeNN SIGSEGV Error Support performance , brian2genn	5	767	28 September 2020
Running Brian 2 on a Macbook with Apple M1 Support	1	998	19 July 2021

Python crash or random computer reboot

Related topics