Best practices for storing time series + metrics across parameter sweeps

(not sure whether Support is the right category since this is less about the features of Brian itself)

I’ve been using Pandas DataFrames lately, especially to store summary statistics from Brian runs.
As I vary multiple parameters of the simulation I record those across columns of a data frame, and then also record output metrics as columns. This “flattens” high-dimensional parameter sweeps so each row is a single simulation run, and then I can map these back out to say a 2D slice of parameter space afterwards with panda’s functions like groupby(). The result looks something like this:

Challenge

I’d like to tie the summary statistics to the primary time-series data that’s associated with it to more easily cross-reference results and verify the summary measures are quantifying what we’d expect. But this introduces time as a new index, which seems “clunky” to make work with dataframes.

Options I’m considering

  1. storing the time series as it’s own DataFrame, with time as the “instance index”/row and channels as the “feature index”/column. Then embed this time series as a new column of the dataframe
    a. either including the dataframe itself by nesting it
    b. or referencing an index / filename for where this data is stored so the higher-level dataframe doesn’t get cluttered

  2. using SKTime which is a machine learning library which seems to focus on time-series analysis and is built around extending pandas dataframes:

    Sktime is designed to work with timeseries data stored as nested pandas DataFrame objects. Similar to working with pandas DataFrames with tabular data, this allows instances to be represented by rows and the feature data for each dimension of a problem (e.g. variables or features) to be stored in the DataFrame columns.

  3. Creating a (more complicated) nested hierarchical data structure and saving it as hdf5

    • incidentally, this is the storage format neurodata without borders uses
    • in some ways this seems like the “right solution” but a potentially complicated one
  4. What I used to do in MATLAB was to construct N-dimensional tensors and keep track of which dimension corresponded to which parameters / time index etc.

    • this is closest to how I think of the data abstractly
    • but seems potentially quite error prone as you have to keep track of indices. And lengths of dimensions need to be carefully compared across instances
    • it’s also quite difficult to visually inspect the data file (in a text editor, excel etc.)
    • but, 2D slices of parameter grid-sweeps can be indexed in straightforward ways

Feedback

  • Are there any commonly used ways to do this within the Brian community?
  • Which of the above approaches seems like the best path forward?
  • Are there other options worth considering?
2 Likes

@adam, a while ago, I stumbled on a similar problem with a bit of complication, that my simulations are pretty long (up to a few days in model time :open_mouth: ). Of course, no one memory can hold even spike times for a whole run, say nothing about time series of voltages or whatever one needs from simulations. The simulator must be stopped every while, intermediate results should be saved, and all buffers must be cleared. So for this kind of data, we need a suitable file to hold chunks of data and keep track of each fragment. I’ve tried Pandas and raw hd5 for this, but both weren’t up to the job. Specifically, hd5 is pretty fragile and doesn’t have enough redundancy to restore data structure when the simulation crashes with bugs and a hd-file doesn’t close properly. I lost 3 days run on this unfortunate ‘feature’ of hd5. I finally figured out that it isn’t a highly complex task and wrote my one python module to hold this kind of data.

But this isn’t the end. Sorry for the long post. My models have lots of parameters, so I thought it may be handy to save parameters for each simulation along with very general statistics of the result, some graphs (in the form of pictures), and my comment on the run. This may be an excellent job for Pandas frame or hd5, but after some tests, I landed to sqlight and general SQL queries. Again, I wrote a wrapper to have a tree-like structure to hold parameters. It got complicated here because I wanted to provide some help-line for each parameter and change parameters through CLI arguments (you see, I’m lazy)! So what I’m using now looks like that:

1. Default configuration.

stkdefault = """
/{
Some general words about the model
The model was coded by me
}

/network/{
    /geom/{
        /shape = 'H'  ; H-hexagonal or S-square
        /x     = 7    ; number of columns
        /y     = 16   ; number of raws
    }
    /gj/{
        /geom/{
            /p     = 0.3    ; gap junction probability
            /unit  = 'n'    ; if n - probability of connection per neuron. it 'c' - probability of connection per pear of neurons within maxd distance
            /maxd  = 1.5    ; gap junction max distance
            /dir   = None   ; gap junction direction (random if None)
        }
        /r     = 700.       ; gap junction resistance
    }
    /syn/{
        /gsyn      = None  \; synaptic conductance: 
                           \; if positive - set for everyone, 
                           \; if negative - set the portion of minimal synaptic conductance for each neuron,
                            ; if None set to 1/3 of the  minimal synaptic conductance for each neuron.
        /rand      = False  ; If ture - gsyn randomly init between 0 and /syn/gsyn value
        /prenorm   = True  \; If true the total synaptic conductance will be set to /syn/gsyn value for each neuron in the **beginning**! 
                            ; It is active only in model building. For dynamical normalization see /network/btdp
        /geom/{
            /pmax  = None  \; peak of synaptic probability for exp or exp2 distribution.
                            ; If None closest LGN cells are connected to each rGC
            /sig   = None   ; sigma for exponential distribution. Works only if /network/syn/geom/pmax is not None
            /sig2  = None   ; sigma for gaussian distribution. Works only if /network/syn/geom/pmax is not None and /sig is None
            /o2o   = False  ; one-to-one creates only one connection for each rGC to a closest LGN cell.
            /a2a   = False  ; all-to-all connections: each rGC connects all dLGN cells. It is the best configuration for btdp
        }
        /NMDA2AMPA  = 2.25  ; ratio between peak of NMDA current to peak of AMPA current
        /AMPA/{
            /tau1   = 1.    ; AMPA rising time constant
            /tau2   = 2.2   ; AMPA falling time contant From Chen & Regehr 2000 Table 1 P10-P14
        }
        /NMDA/{
            /tau1   = 1.    ; NMDA rising time constant
            /tau2   = 150.  ; NMDA falling time contant From Chen & Regehr 2000 Table 1 P10-P14
        }
        /ppr_u0     = 0.3  \; sets presynaptic single spike depression
                            ; it set to obtain paired-pulse ratio 0.73 From Chen & Regehr 2000 Table 1 P10-P14
       /toneExc     = None  ; Conductance for tonic excitation (None to block)
       /toneInh     = None  ; Conductance for GABA_B receptors (None to block)
    }
......
}
"""
#building model parameter tree and process a CLI arguments
mth = methods(stkdefault, 'mth', locals(), argvs = args )
#and using them 
syn_equ="""
da1/dt = -a1/{/AMPA/tua1}/ms :1
da2/dt = -a2/{/AMPA/tua2}/ms :1
dm1/dt = -m1/{/NMDA/tua1}/ms :1
dm2/dt = -m2/{/NMDA/tua2}/ms :1
""".format(**mth["/network/syn"])

2. Help generation.

 python mymodel.py --help

Some general words about the model
The model was coded by me

Usage: mymodel.py [options] [parameters]

Any model parameter can be altered by /parameter_name=parameter_value in command line

Parameters:
/network/geom/shape             = 'H'
                                : H-hexagonal or S-square

/network/geom/x                 = 7
                                : number of columns

/network/geom/y                 = 16
                                : number of raws

/network/gj/geom/p              = 0.3
                                : gap junction probability

/network/gj/geom/unit           = 'n'
                                : if n - probability of connection per neuron. it 'c' - probability of connection per
                                : pear of neurons within maxd distance

/network/gj/geom/maxd           = 1.5
                                : gap junction max distance

/network/gj/geom/dir            = None
                                : gap junction direction (random if None)

/network/gj/r                   = 700.
                                : gap junction resistance

/network/syn/gsyn               = None
                                : synaptic conductance:   if positive - set for everyone,   if negative - set the
                                : portion of minimal synaptic conductance for each neuron,  if None set to 1/3 of the
                                : minimal synaptic conductance for each neuron.

/network/syn/rand               = False
                                : If ture - gsyn randomly init between 0 and /syn/gsyn value

/network/syn/prenorm            = True
                                : If true the total synaptic conductance will be set to /syn/gsyn value for each neuron
                                : in the **beginning**!   It is active only in model building. For dynamical
                                : normalization see /network/btdp

/network/syn/geom/pmax          = None
                                : peak of synaptic probability for exp or exp2 distribution.  If None closest LGN cells
                                : are connected to each rGC

/network/syn/geom/sig           = None
                                : sigma for exponential distribution. Works only if /network/syn/geom/pmax is not None

/network/syn/geom/sig2          = None
                                : sigma for gaussian distribution. Works only if /network/syn/geom/pmax is not None and
                                : /sig is None

/network/syn/geom/o2o           = False
                                : one-to-one creates only one connection for each rGC to a closest LGN cell.

/network/syn/geom/a2a           = False
                                : all-to-all connections: each rGC connects all dLGN cells. It is the best configuration
                                : for btdp

/network/syn/NMDA2AMPA          = 2.25
                                : ratio between peak of NMDA current to peak of AMPA current

/network/syn/AMPA/tau1          = 1.
                                : AMPA rising time constant

/network/syn/AMPA/tau2          = 2.2
                                : AMPA falling time contant From Chen & Regehr 2000 Table 1 P10-P14

/network/syn/NMDA/tau1          = 1.
                                : NMDA rising time constant

/network/syn/NMDA/tau2          = 150.
                                : NMDA falling time contant From Chen & Regehr 2000 Table 1 P10-P14

/network/syn/ppr_u0             = 0.3
                                : sets presynaptic single spike depression  it set to obtain paired-pulse ratio 0.73
                                : From Chen & Regehr 2000 Table 1 P10-P14

/network/syn/toneExc            = None
                                : Conductance for tonic excitation (None to block)

/network/syn/toneInh            = None
                                : Conductance for GABA_B receptors (None to block)

3. Changes in parameters.

Try model with different AMPA time constant as simple as

 python mymodel.py /network/syn/AMPA/tau2=30.5

3. All parameters are saved in a single database

even with some preview

and diff between model parameters

Of course, one can put lightweight statistics into a “parameter file” and analyze it later. I usually compute mean firing rates or R2 synchronization index at the end of the run, so I can also see the diff in results with the diff in parameters. Moreover, I can collect parameters/data through the database and export it to Pandas/NumPy for analyzes.

4. BUT

Both codes for parameter manipulation and for storing big chunks of data are combined in one simulation toolkit simtoolkit. Unfortunately, all of this isn’t in great shape. It needs lots of development and helps with bug fixing. However, whenever I’m trying to finish documentation and create a GitHub page for this project, I’m sinking into words and postponing the GitHub repository. At this point, it works only in my hands, and I’m not sure if it will be helpful to anyone else.

I must spend a couple of days cleaning code and documenting this project at some point, but if you find it interesting to consider as an alternative, I’ll commit to doing this work asap.

RTH

Thanks for the detailed response!

It sounds like your experiments are at a much larger scale than mine, but a lot of needs for data management overlap.

Some of the hierarchical parameter description is quite similar to the custom classes I ended up putting together to describe my simulations. I usually divide things into specifications for:

  • the time-span / structure of the run
  • the network
    • the neurons
  • any interventions / stimulation

and then try to have similarly structured specifications for any analysis / post-processing. I’ve found this to be quite effective at the small-to-medium scale of simulation both for reminding me the details of the simulation, and acting as a container for often-used post-processing steps. This also allows for easily encoding default parameter values and passing in only the modifications you want to sweep across. It seems like this functionality is also handled well with your command line options to the script!

Your custom database solution sounds like it has a number of useful features, and I would be interested in trying it out! But for now it looks like a pandas dataframe for summary features + a reference to a filename containing the full timeseries (saved currently using numpy’s savez format ) get’s the job done for me. So I don’t think it’s worth dramatically reprioritizing your time, but please keep me updated on progress!

Some further scattered thoughts as I’ve tried to cobble together a solution:

“Wide” vs “Long”/“Tidy” data formats

I’ve been working less with imperative graphing libraries (like matplotlib) where I have low-level control over the pipeline from file/database → extracting arrays → plotting arrays to declarative visualization frameworks based around “grammar of graphics” (altair / vega-lite / ggplot) where the inputs are usually a dataframe and a column name, and the library then handles extracting the data.

This means there are stricter requirements on what structure the data takes by the time it gets visualized. Specifically altair / vega-lite seem to have a strong preference for a “tidy” format, i.e. one row per observation, rather than grid-like layouts.
I’ve been wrestling with pandas melt() function to handle this conversion but I’m currently finding it a little unintuitive.

@rth if there was a simple way to ensure a “tidy-format” export / rearrangement from your database that would be great.

Misc. considerations a good storage system

  • plain-text “human-readable” formats v.s. binary
    • this has an impact on the ability to visually inspect the integrity of the data which is very nice
    • but also seems to impact git’s ability to diff files. Binary files seem very slow to add to git because of this. The standard solution to this seems to be simple to avoid version-controlling binary data.
  • ability to load partial chunks of data
    • this seems to be one of the primary selling points of hdf5
  • easy of integration with summary visualization tools
    • also integration with data-analysis tools like numpy / pandas
  • tight connection between metadata / summary data and primary timeseries

Additional considerations (which are good to have, but don’t strongly impact my user experience at the moment)

  • memory efficiency of storage
  • speed of saving/loading to/from files
  • portability across platforms (i.e. python versions)
1 Like

RTH, a colleague who works on machine learning suggested I look into weights and biases which has

I think the biggest downside is it’s optimized for machine-learning applications, but it seems flexible enough to be able to record biophysical network sims for instance.

This might be worth looking into as a related solution to compare features to.

1 Like

That was precisely my starting point. I used git as an ‘experimental’ journal for many years to keep model parameters and some messages (like old school experimentalists, who had a barn book-keeping journal to write every preparation and manipulation). BTW, the git tag system is very handy to mark some parameters that produce a particular figure, so you can return and reproduce it (or put instruction on ModelDB). However, I got more sophisticated and diverse models after a while, so git failed to keep all I wanted. Two of the critical demands: that parameters should be annotated somehow and that parameters quite often depend on each other, drove me to develop this database.

My practice (maybe quite unhealthy) is to develop analytic code and visualization tools for each project I’m working on. This is almost because it isn’t possible to predict what kind of results you will have and the direction the project will go. So from this perspective, simtoolkit is universal without any specificity, even for computational neuroscience.

That is a good point, @adam. Because in DB, all parameter names are virtually in the tree structure, there are no reasons that extraction of tables of parameters with parameter names as the names of columns should be a problem. It’s SQL DB, and therefore, this must be only one query internally and should be quite fast (I guess). However, it isn’t implemented (yet) almost because I never use simtoolkit in this way.

I should be more explicit here. There are two file types that stktoolkit will offer: stkdb for parameters and lightweight statistics (SQL like DB internally) and stkdata for chunked bulky data.

stkdata won’t offer parallel writing but parallel reading for sure. Variables in stkdata also virtually in the tree structure. So yes, the data/parameters labeling and representation are consistent through the whole module.

@adam, you are asking good questions. Please continue.

1 Like

wonderful! sounds like there’s a lot of overlap between how we’re thinking about these issues.

Your discussion of git tags reminds me of an idea I had a while ago, and partially implemented:
automatically stamping a snapshot of the script name and commit id that generates a figure into the background as a “watermark.”
Obviously you’d take this out of production figures. But the idea of having a default behavior of linking all figures back to their generating code (especially with minimal need for manual intervention) seems very useful

Another thought:
Are there any ways such a system could be well-integrated with brian? i.e. an automated network specification export and import. Some version of this may already exist that I’m not aware of, but if it doesn’t it seems like it would be useful, but potentially a lot of work

1 Like

Now I feel shame that I haven’t githubed the project yet :open_mouth: !

I don’t think so! Note, that stkmethods convert everything in parameters into python objects. So …

stkdefaults="""
/a = 1       ; integer
/b = 1.2     ; float
/c = [1,2,4] ; list
/d =(1,2,3)  ;tuple
/e = "Hello STK" ;string
"""
mth = methods(stkdefaults, 'mth', locals(), argvs = args )
mth.generate()
# Now
for n in mth:
   print (n, type(mth[n]))

and the result is

/a <class 'int'>
/b <class 'float'>
/c <class 'list'>
/d <class 'tuple'>
/e <class 'str'>

So one can put all Brian’s equations directly into the model parameter, although it would look cumbersome.

Moreover, one can address a whole subtree, which can be seen as a dictionary.
This also works very well, and I use it a lot!

syn_equ="""
da1/dt = -a1/{/AMPA/tua1}/ms :1
da2/dt = -a2/{/AMPA/tua2}/ms :1
dm1/dt = -m1/{/NMDA/tua1}/ms :1
dm2/dt = -m2/{/NMDA/tua2}/ms :1
""".format(**mth["/network/syn"])

Finally, and that is pretty handy, methods can do calculations on the fly and refer to each other.

>>> stkdefaults="""
... /a = 3 ; just a number
... /b=@/a@*12; just 12 of them
... """
>>> mth = methods(stkdefaults, 'mth', locals(), argvs = None )
>>> mth.generate()
False
>>> for n in mth: print(n, "=", mth[n])
... 
/a = 3
/b = 36

For example, I compute synaptic amplitude normalization in this way and can switch between peak and space-under-the-curve normalizations in one CLI argument /network/syn/AMPA/norm=@/network/syn/AMPA/peack@ This normalization will be recalculated if I add /network/syn/AMPA/tau2=1000.

I think a more in-depth integration with brian is possible but will make simtoolkit a more niche module. Maybe simtoolkit may have submodules that will tune it to a specific simulator more tightly: for example, import unites into defaults :smiley: .

P.S. I’m going to travel next few days, but I may have a bit of time to work on simtoolkit and push it towards github a little bit

1 Like

This is an amazing thread, thanks for sharing @adam and @rth ! I’ll move it to the Science, Projects, and Showcases category for now, simply because we’d like to reserve the Support section for things that can be answered (which makes it easier for us to keep track of open support section). But maybe we should have something like a “best practice” section for this kind of discussion?

It’s been awhile since I did large-scale explorations/simulations, so in my case I can usually “get away” with a relatively straightforward directory structure storing all the raw data in a directory per simulation, and having a single csv file with all summary statistics, etc. Regarding the initial question, I’d probably simply have a unique name for each simulation stored in the data frame and have that translate to a directory and/or file name. I’ve never actually tried to put something like raw StateMonitor data into a pandas data frame.

Some more random remarks:

Hmm, interesting thought. I wonder why it is not more common to use a figure’s metadata for it! E.g. with matplotlib’s savefig you can do something like plt.savefig('myfig.png', metadata={'script_name': os.path.abspath(__file__)) – there are some standard meta data entries (creator, etc.), but you can add arbitrary keys.

2 Likes

Thanks, there’s lots of great advice / resources here. It seems like the simplest, robust-enough solution currently is indeed to have summary stats in plaintext / csv / yaml then reference the time series as filename to say a numpy binary.

In putting together the notebook for the gaussian delay issue I finally bit the bullet on figuring out how pandas hierarchical indexing works. Here’s a table where the highest index is the neuron group name, and the lower index is the neuron id within that group:

this is particularly convenient for plotting libraries that like DataFrames (plotly, altair etc.)
where you can create all your subplots with:

fig = px.line(dfm, x='time [ms]' ,y='voltage', color='population, facet_row='total_neuron_idx')

(with the caveat that this requires “melting” the dataframe in a slightly unintuitive way)

same data plotted, faceting along different dimensions, only changing a single line of code:

fig = px.line(dfm, x='time [ms]', y='voltage', facet_row='population', color='neuron')

hierarchical_voltage2


This snippets for how I achieved this will be in the gaussian delay notebook, but I’m happy to turn them into a pull request if they end up being useful.

I saw that there’s been some prior work in exporting Brian traces to pandas dataframes:

but as far as I can tell this can only store 1D variables to dataframes, is this correct?

EDIT: now under version control at GitHub - awillats/brian_delayed_gaussian

This is correct, the pandas export only handles 1D arrays. In fact, it even fails with an ugly error message for 2D arrays :slightly_frowning_face: It would be great to support this somehow… The general architecture of the exporter is to deal with each variable individually (and you can ask get_states to only return a single variable). This approach does not quite work for recorded 2D array, since you need at least the t variable to interpret things. But maybe it is worth changing the export so that it handles variables like i and t in a special way?

yeah, for my example above, my wrapper function assigns the channels of the monitor output to the corresponding column of the dataframe, then appends the time [ms] column afterwards.

It makes sense to me for export methods to always add t or at least i as mandatory indexing variables. One minimal way to do this with pandas would be to assign the simulation index to the row index with the set_index() method. This would mean not having to add an additional column just to do this job.
You could then have an optional append_timestamp flag to include t as a new column

EDIT: I have my work-in-progress implementation for expanding voltage monitors to hierarchical dataframes here: