Delay in continuous connections

Hi,

I’m looking to simulate a circuit with delay and compare between a spiking model and a rate model.
There’s a simple interface for adding delay to spiking synapses, but I can’t see a straightforward way to do this for continuous connections.

It looks like there have been a few discussions related to this idea:

I’d like to add +1 to the interest in seeing this feature implemented in Brian, and also ask what the current best “workaround” solution for this is

-Adam

1 Like

Hi @adam,

After discussion with @mstimberg, I ended up in a pretty bulky code. This realization of the delay function as NetworkOperation slows down simulations dramatically. I’m not sure whether it is useful to share because I’m really unhappy with the performance. If you come up with something fast and simple, please share.

-RTH

P.S. but if you insist, I can put my code here, of course :slight_smile:

1 Like

Just adding to what @rth said: there’s also the option to implement this in the target language, e.g. C++ for C++ standalone mode as a user-defined function. This is certainly the fastest solution, but also the most difficult to write.
The NetworkOperation solution will be always be relatively slow, but there might be some non-obvious ways to speed things up (using simple floating point values instead of Brian quantities, avoiding array copies, …). If @rth is willing to share a simple example here, I could have a look to see whether there are ways to speed it up.

Hi Marcel, could you point me to info on integrating C++ functions in Brian?
If getting data into and out of C++ is relatively straightforward, and it amounts to writing something like a network operation but in C++, I could probably put that together in a reasonable amount of time

Hi Adam, there’s some basic documentation here: Functions — Brian 2 2.4.2 documentation
Were it gets complicated is when you want this function to change things in model variables instead of simply returning a value. If this is in C++ standalone mode, you can make use of the fact that all variables are stored in globally available arrays and that you can get their names with something like device.get_array_name(neurongroup.variables['V']). There is a discussion thread that should give some more hints: User-defined functions
Let me know if you run into issues, this part of Brian still needs quite a bit of polishing.

2 Likes

@mstimberg just a sketch of my construction. Note that different connections in the system have different delays…

 def step_delay():
        #                                                    0         1         2         3           4           5     
        step_delay.buffer[:,step_delay.windex] = hstack((Cx.CxE[:],Cx.CxK[:],Cx.CxP[:],Cx.CxI[:],ThRe.ThE[:],ThRe.ReI[:]))
        N = JRTRG.params['N']
        Cx2Cx.CxE2CxP_delval = step_delay.buffer[0*N:1*N,(Cx2Cx.CxE2CxP_delint+step_delay.windex)%step_delay.maxdel]
        Cx2Cx.CxP2CxE_delval = step_delay.buffer[2*N:3*N,(Cx2Cx.CxP2CxE_delint+step_delay.windex)%step_delay.maxdel]
        Cx2Cx.CxP2CxI_delval = step_delay.buffer[2*N:3*N,(Cx2Cx.CxP2CxI_delint+step_delay.windex)%step_delay.maxdel]
        Cx2Cx.CxI2CxP_delval = step_delay.buffer[3*N:4*N,(Cx2Cx.CxI2CxP_delint+step_delay.windex)%step_delay.maxdel]
        Cx2Cx.CxE2CxK_delval = step_delay.buffer[0*N:1*N,(Cx2Cx.CxE2CxK_delint+step_delay.windex)%step_delay.maxdel]
        Cx2Cx.CxK2CxE_delval = step_delay.buffer[1*N:2*N,(Cx2Cx.CxK2CxE_delint+step_delay.windex)%step_delay.maxdel]
        TR2TR.ThE2ReI_delval = step_delay.buffer[4*N:5*N,(TR2TR.ThE2ReI_delint+step_delay.windex)%step_delay.maxdel]
        TR2TR.ReI2ThE_delval = step_delay.buffer[5*N:   ,(TR2TR.ReI2ThE_delint+step_delay.windex)%step_delay.maxdel]
        TR2Cx.ThE2CxP_delval = step_delay.buffer[4*N:5*N,(TR2Cx.ThE2CxP_delint+step_delay.windex)%step_delay.maxdel]
        TR2Cx.ThE2CxE_delval = step_delay.buffer[4*N:5*N,(TR2Cx.ThE2CxE_delint+step_delay.windex)%step_delay.maxdel]
        TR2Cx.ThE2CxK_delval = step_delay.buffer[4*N:5*N,(TR2Cx.ThE2CxK_delint+step_delay.windex)%step_delay.maxdel]
        Cx2TR.CxP2ThE_delval = step_delay.buffer[2*N:3*N,(Cx2TR.CxP2ThE_delint+step_delay.windex)%step_delay.maxdel]
        Cx2TR.CxP2ReI_delval = step_delay.buffer[2*N:3*N,(Cx2TR.CxP2ReI_delint+step_delay.windex)%step_delay.maxdel]
        step_delay.windex    = (step_delay.maxdel-1) if step_delay.windex == 0 else (step_delay.windex - 1)

step_delay.maxdel is computed first. That is max steps needed for the logest delay in the system. Then step_delay.buffer is initialized.

step_delay.maxdel = int( round(step_delay.maxdel*ms/defaultclock.dt)+10 )
step_delay.buffer = zeros((NVariables,step_delay.maxdel))
step_delay.windex = 0

In synaptic model we use _delval. For example

inCxP2CxE_post = CxP2CxE * CxP2CxE_delval    : 1 (summed)
CxP2CxE_delval                               : 1
CxP2CxE_delint                               : integer (constant)

And finally I declare NetworkOperation function

netstdel = NetworkOperation(step_delay)

something like this. I tried multiple buffers the step_delay function, but this is even slower.

I’m going to take a shot at a very simple demo notebook with a naive network_operations implementation for this today.

I haven’t looked into the C++ standalone option yet, but I want to make sure I’ve ruled out the simplest solution first (for my application something slow might be good enough).

I see @rth has a network operations solution also. I’m going to start from scratch and then afterwards we can look for commonalities / differences, and hopefully the notebook will be minimal enough for it to be easy to modified and give feedback on

Haven’t got to the network_operations part yet, but here’s some implementations of a 2D buffer which “rolls” along the 2nd dimension:

Google Colab notebook

scripts and helper functions under version control at GitHub - awillats/brian_delayed_gaussian

code reposted here: class definitions

from collections import deque 
import numpy as np

class DQRingBuffer(deque):
    '''
    NOTE: becuase of storage of columns as tuples, indexing is less flexible than numpy version

    Assumes last value in buffer is current i.e. 0-delay
    - this is compatible with calling
        buff.append(current_val)
        current_val_also = buff.get_delayed(0) 
    - see more discussion at https://stackoverflow.com/questions/4151320/efficient-circular-buffer
    '''
    from collections import deque
    def __init__(self, buffer_len = 100, n_channels=1):
        super().__init__([],maxlen = buffer_len)
        self.n_channels = n_channels
    #inherits append method
    def append(self, new_val):
        if self.n_channels > 1 and len(new_val) is not self.n_channels:
            raise IndexError(f'ERROR, wrong size to append\n expected size {self.n_channels} got {len(new_val)}')
            return None
        if isinstance(new_val, np.ndarray):
            new_val = tuple(new_val)
        return super().append(new_val)
    
    def fill(self, fill_val):
        for i in range(self.maxlen):
            self.append(fill_val)
    def to_np(self):
        return np.asarray(self)
    def get_delayed(self, delay=1):
        if not 0 <= delay < self.maxlen:
            # won't let you return get_delayed(0),
            raise IndexError(f'delay {delay} out of bounds for buffer length {self.maxlen}')
        # print(-delay-1)
        return self[-delay-1]

class RingBuffer:
    '''
    loose implementation of a ring-buffer with numpy arrays
    - only buffers along axis=1
    - consider something like collections.deque for a stricter implementation
    - see also Implementing a Ring Buffer - Python Cookbook: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch05s19.html
    - speed test for roll implementations: https://gist.github.com/cchwala/dea03fb55d9a50660bd52e00f5691db5

    '''
    def __init__(self,buffer_len = 100, initial_val=None):
        self.n_channels = n_channels 
        self.buffer_len = buffer_len
        self.values = np.full(self.buffer_len, initial_val)
    
    def append(self, new_val):
        # self.values = np.append(self.values, new_column)
        # self.values = self.values[:, 1:]
        self.values = np.roll(self.values, -1, axis=0)
        self.values[-1] = new_val

    def last(self):
        return self.values[-1]

    def get_delayed(self, delay=1):
         
        if not 0 <= delay < self.buffer_len:
            raise IndexError(f'delay {delay} out of bounds for buffer length {self.buffer_len}')
        # alternate implementation would
        # cap the delay, so that delay = -Inf returns the oldest value in the array.
        # delay = max(min(delay, self.size),1)
        return self.values[-delay-1]
        
    def fill(self, fill_val):
        self.values.fill(fill_val)
    def __getitem__(self, idx):
        return self.values[idx]
    def __str__(self):
        return f'buffer with len: {self.buffer_len}, \nvals: {str(self.values)}\n'
        
class RingBuffer_2D:
    '''
    (could be made simpler by extending numpy.ndarray)
      would eliminate need to duplicate:
        - fill()
        __getitem__()
        - str()
    loose implementation of a ring-buffer with numpy arrays
    - only buffers along axis=1
    - consider something like collections.deque for a stricter implementation
    - see also Implementing a Ring Buffer - Python Cookbook: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch05s19.html
    '''
    def __init__(self, n_channels, buffer_len = 100, initial_val=None):
        self.n_channels = n_channels 
        self.buffer_len = buffer_len
        self.values = np.full((self.n_channels, self.buffer_len), initial_val)
    
    def append(self, new_column):
        #expands column to 2D 
        self.values = np.append(self.values, new_column[:,None], axis=1)
        self.values = self.values[:, 1:]
        
        # np.roll is about 2x slower than method above for medium-sized arrays
        # self.values = np.roll(self.values, -1, axis=1)
        # self.values[:,-1] = new_column

    def last(self):
        return self.values[:,-1]

    def get_delayed(self, delay=1):
        if not 0 <= delay < self.buffer_len:
            raise IndexError(f'delay {delay} out of bounds for buffer length {self.buffer_len}')
        # alternate implementation would
        # cap the delay, so that delay = -Inf returns the oldest value in the array.
        # delay = max(min(delay, self.size),1)
        return self.values[:,-delay-1]
        
    def fill(self, fill_val):
        self.values.fill(fill_val)
    def __getitem__(self, idx):
        return self.values[idx]
    def __str__(self):
        return f'buffer with {self.n_channels} columns, len: {self.buffer_len}, \nvals: {str(self.values)}\n'
code reposted here: testing the classes
# if __name__ == "__main__":
import time 

n_channels = 10
buff_len = 10000

np_buff = RingBuffer_2D(n_channels = n_channels, buffer_len=buff_len)
dq_buff = DQRingBuffer(buff_len, n_channels)
dq_buff2 = DQRingBuffer(buff_len, n_channels)
'''
DeQue ring buffer is generally faster than numpy 2D version
'''

ntest = 1000 #10000
t0 = time.time()

vec = np.arange(n_channels)
vec.shape

dq_buff.fill(vec*0)
dq_buff2.fill(vec*0)

# Test numpy 2D array as buffer
for i in range(ntest):
    np_buff.append( vec+i )
    _ = np_buff.get_delayed(1)

t1 = time.time()
num_t = t1-t0
print('numpy 2D test')
print(f'{num_t:.3f} seconds for {ntest} iterations\n')

# Test deque of tuples as buffer
t0 = time.time()
for i in range(ntest):
    dq_buff.append( vec+i )
    _ = dq_buff.get_delayed(1)
t1 = time.time()
dq_t = t1-t0
print('deque test')
print(f'{dq_t:.3f} seconds for {ntest} iterations\n')


# Test deque of tuples as buffer + time to convert to numpy
t0 = time.time()
for i in range(ntest):
    dq_buff2.append( vec+i )
    _ = np.array(dq_buff2.get_delayed(1))
t1 = time.time()
dq_conv_t = t1-t0
print('deque test w/ conversion to numpy')
print(f'{dq_conv_t:.3f} seconds for {ntest} iterations\n')



print( f'deque {num_t/dq_t:.2f} times faster')
print( f'deque {num_t/dq_conv_t:.2f} times faster with conversion back to numpy array') 

the upshot is collections.deque [docs] is much faster, especially for long arrays than using a numpy 2D array. (up to 150x faster in some tests). It’s also worth noting that deque has an implementation in the standard library in C++

If you want to use a numpy 2D array (because the indexing is much more flexible this way) you should “manually” implement the roll piece:

    def append(self, new_column):
        #expands column to 2D 
        self.values = np.append(self.values, new_column[:,None], axis=1)
        self.values = self.values[:, 1:]

rather than use the built in np.roll() method

update:
while deque is much faster for getting single values from the buffer, the magnitude of this speed difference is quite small relative to the total runtime of a Brian2 simulation (for the very small networks and ~100 sample buffer lengths I tested).

Moreover, if you have to convert back to numpy arrays to use advanced indexing, the deque approach may end up being slower.


In other news, I got these plots by combining marcel’s network_operation solution with this custom numpy ring buffer

showing buffered results match output of voltage monitors:
Screen Shot 2021-11-16 at 5.05.26 PM

showing signal B is delayed from signal A:
Screen Shot 2021-11-16 at 12.54.33 PM

the script gaussian_delay_example_netop_multi_population.py
now supports heterogeneous delays

I also added a naive “development roadmap” / sub-feature request list to the README.md for the repo

@adam, a few notes from the back of my mind (I’m traveling now, therefore very briefly).

  1. all possible FIFO solutions (deque, for example) require memory allocation for each computational step, even on the level of C/C++, mostly because FIFO doesn’t know required buffer depth (it may be even variable for FIFO). Memory allocation is always slow. Therefore, if we know the max delay, it is better to allocate all required memory and use a cycling pointer/index. It should be much faster than any FIFO solution.
  2. np.concatenate([ G.v[:] for G in all_groups ]) looks pretty elegant, but you need to be sure that you know where each group is located in the vector, otherwise, it’s easy to mess up.
  3. let’s think in terms of a memory controller and quite a big model/long delays. The whole 2D “delay” table may not fit into the cache memory. In this case, it may be faster to have independent buffers for each delay line, so it will be a sequential reading - the most effective scenario for a memory controller.

sorry for the squeezed reply, a more detailed analysis of your very interesting and thoughtful posts, as well as amazing @mstimberg comment on parameters handling a bit later.

1 Like

I agree on all of these points! I’m hoping the repository i’m working on will be a good way to explore and communicate those tradeoffs.

In general, I’m marching (slowly) from the simpler solutions to construct and index values from (i.e. a dense numpy array) towards more tailored, memory and speed-efficient solutions. My hope is that this will eventually result in an efficient C++ implementation, but we can stop at a threshold of usability / efficiency at any point along the way.

I’m also trying to gather up as many of these ideas as possible in the README.md for the repo (including ideas like a “rolling index” and various sparse storage options like “lists of lists” and “COO” coordinate lists).

As to point 2 about tracking groups, currently I have the luxury of working with small, relatively abstract networks where knowing the number of populations and neurons per population is enough to index which signals come from where. However I might try something like a dictionary (with population names as keys) to balance elegant access to recorded variables and more robust-to-human-error recording of which population is where.

EDIT: @rth, the first time I read your comment I misinterpreted it as being about sparse storage, but now I think point 3. refers to orienting the matrix used for storage such that the network state can be sliced at a single point in time (with such a vector stored contiguously in memory?).

If so, that’s not something I’d considered, and I will add that as a consideration under the “indexing conventions / storage” section


The scripts for the repo are quite messy at the moment as I’m working on visualizing the cross-correlation between signals. Part of my selfish interest in understanding how to simulate delayed interactions in Brian is to study how network properties affect when we can recover connectivity. The plotting is also overly complicated because I’m experimenting with bundling brian data into dataframes then using plotly for interactivity - i’ll hopefully post an interactive demo here by the end of the day.


here are some quick whiteboard sketches to work through thinking about sparse storage

heres an interactive plot I generated from a gaussian network with delays:
https://htmlpreview.github.io/?https://github.com/awillats/brian_delayed_gaussian/blob/main/figs/gaussian_combo.html
you can subselect traces by clicking on the legend, and zooming into the x and y-axes are linked across the relevant subplots

here’s a non-interactive preview:

@adam, I’m sorry, I promised you some explanation about optimal construction for the delay procedure. Here is what I thought maybe a good idea to try.

Basic cyclic buffer/array

Let we have five variables we want to delay and let all delays be the same. We can create a buffer with shape (5,size), where size = delay/dt. We account for the poison where to put new data as index ‘idx’ (white arrow), and compute the position where to get delayed data by (idx+delay)%size (black arrow). Buе note that delay=size-1. Something like that:

Now imagine that we have different delays. We need to read different positions in the buffer

That is how my realization works, and that would be fine in 1990, just before desktop CPU with cache memory were introduced!!!

Problem with a single table for all variables

It’s simple to see what happens when the table is bigger than the data cache. The cache controller is really confused by a huge array with barely assess.
g2145

Of course, nowadays, cache controllers are clever, and they can actually predict the next reading point in this situation. However, we can make the life of a cache controller much easier.

We can have different tables for each variable.

I hope it is obvious now I’m going, right?
g2146

We can force the cache controller to be very efficient by creating independent tables and indices. This allows to keep only valid data in cache: the place to read and to write, and slowly transfer memory pages to the main memory, while CPU computes the next step of the simulation.

What do you think? I haven’t tried to implement that, so I really don’t know whether it will be faster or not.

One more thought on that.

My realization of delays, as a single huge table, can work if we need delays which are different, but not so different. If a chunk of the table where needs to be read is fit into cache, this will should give the optimal performance.
g2147

In other words, there may be a few tables for short, medium, and long delays.
It seems the same appraoch as in lookup-tables implementation (code is here) can be used.

Hi @rth and @adam. These are some very relevant thoughts about performance! The tricky thing is always that it is easy to optimize for special cases (homogenous delays, a few different delays, …), but putting a system together that works optimally (or close to) for all cases is hard…
But to avoid any duplicated efforts, just a few comments implementation-wise. I think the best strategy to implement the feature would to attach it to the current SpikeQueue implementation, which already handles the basic propagation with delays (and at least has an optimized implementation for the homogenous case). The idea would be that a SpikeQueue could not only store/transmit information about the origin of a spike, but additionally a “payload” value. If I am not mistaken, this is the way connections with continuous delays are handled in NEST, for example. Then, there is quite a bit of optimization that can be done for the SpikeQueue in the direction that you point out, and also with regard to parallelisation with OpenMP. Any work on this would then not only benefit connections with continuous delays, but also the much more common use case of spike-based connectivity. I think this work could also benefit from similar considerations that have been implemented in Brian2CUDA, which we’ll detail in a preprint in a few weeks.

@mstimberg, are you suggesting to operate on opened Brian’s heart (i.e., its original core code)?!
I thought we were playing outside the main body, in the playground of external modulus… :nerd_face:

Well, it may be a fun side-project for me, without a big commitment, at this moment. However, please do not expect too much. I’ll be happy to work with anyone who takes all the heavy lifting of coding.