Best practices for storing time series + metrics across parameter sweeps

adam · 8 November 2021 18:13

(not sure whether Support is the right category since this is less about the features of Brian itself)

I’ve been using Pandas DataFrames lately, especially to store summary statistics from Brian runs.
As I vary multiple parameters of the simulation I record those across columns of a data frame, and then also record output metrics as columns. This “flattens” high-dimensional parameter sweeps so each row is a single simulation run, and then I can map these back out to say a 2D slice of parameter space afterwards with panda’s functions like groupby(). The result looks something like this:

Challenge

I’d like to tie the summary statistics to the primary time-series data that’s associated with it to more easily cross-reference results and verify the summary measures are quantifying what we’d expect. But this introduces time as a new index, which seems “clunky” to make work with dataframes.

Options I’m considering

storing the time series as it’s own DataFrame, with time as the “instance index”/row and channels as the “feature index”/column. Then embed this time series as a new column of the dataframe
a. either including the dataframe itself by nesting it
b. or referencing an index / filename for where this data is stored so the higher-level dataframe doesn’t get cluttered
using SKTime which is a machine learning library which seems to focus on time-series analysis and is built around extending pandas dataframes:

Sktime is designed to work with timeseries data stored as nested pandas DataFrame objects. Similar to working with pandas DataFrames with tabular data, this allows instances to be represented by rows and the feature data for each dimension of a problem (e.g. variables or features) to be stored in the DataFrame columns.
Creating a (more complicated) nested hierarchical data structure and saving it as hdf5
- incidentally, this is the storage format neurodata without borders uses
- in some ways this seems like the “right solution” but a potentially complicated one
What I used to do in MATLAB was to construct N-dimensional tensors and keep track of which dimension corresponded to which parameters / time index etc.
- this is closest to how I think of the data abstractly
- but seems potentially quite error prone as you have to keep track of indices. And lengths of dimensions need to be carefully compared across instances
- it’s also quite difficult to visually inspect the data file (in a text editor, excel etc.)
- but, 2D slices of parameter grid-sweeps can be indexed in straightforward ways

Feedback

Are there any commonly used ways to do this within the Brian community?
Which of the above approaches seems like the best path forward?
Are there other options worth considering?

rth · 10 November 2021 02:13

@adam, a while ago, I stumbled on a similar problem with a bit of complication, that my simulations are pretty long (up to a few days in model time ). Of course, no one memory can hold even spike times for a whole run, say nothing about time series of voltages or whatever one needs from simulations. The simulator must be stopped every while, intermediate results should be saved, and all buffers must be cleared. So for this kind of data, we need a suitable file to hold chunks of data and keep track of each fragment. I’ve tried Pandas and raw hd5 for this, but both weren’t up to the job. Specifically, hd5 is pretty fragile and doesn’t have enough redundancy to restore data structure when the simulation crashes with bugs and a hd-file doesn’t close properly. I lost 3 days run on this unfortunate ‘feature’ of hd5. I finally figured out that it isn’t a highly complex task and wrote my one python module to hold this kind of data.

But this isn’t the end. Sorry for the long post. My models have lots of parameters, so I thought it may be handy to save parameters for each simulation along with very general statistics of the result, some graphs (in the form of pictures), and my comment on the run. This may be an excellent job for Pandas frame or hd5, but after some tests, I landed to sqlight and general SQL queries. Again, I wrote a wrapper to have a tree-like structure to hold parameters. It got complicated here because I wanted to provide some help-line for each parameter and change parameters through CLI arguments (you see, I’m lazy)! So what I’m using now looks like that:

1. Default configuration.

stkdefault = """
/{
Some general words about the model
The model was coded by me
}

/network/{
    /geom/{
        /shape = 'H'  ; H-hexagonal or S-square
        /x     = 7    ; number of columns
        /y     = 16   ; number of raws
    }
    /gj/{
        /geom/{
            /p     = 0.3    ; gap junction probability
            /unit  = 'n'    ; if n - probability of connection per neuron. it 'c' - probability of connection per pear of neurons within maxd distance
            /maxd  = 1.5    ; gap junction max distance
            /dir   = None   ; gap junction direction (random if None)
        }
        /r     = 700.       ; gap junction resistance
    }
    /syn/{
        /gsyn      = None  \; synaptic conductance: 
                           \; if positive - set for everyone, 
                           \; if negative - set the portion of minimal synaptic conductance for each neuron,
                            ; if None set to 1/3 of the  minimal synaptic conductance for each neuron.
        /rand      = False  ; If ture - gsyn randomly init between 0 and /syn/gsyn value
        /prenorm   = True  \; If true the total synaptic conductance will be set to /syn/gsyn value for each neuron in the **beginning**! 
                            ; It is active only in model building. For dynamical normalization see /network/btdp
        /geom/{
            /pmax  = None  \; peak of synaptic probability for exp or exp2 distribution.
                            ; If None closest LGN cells are connected to each rGC
            /sig   = None   ; sigma for exponential distribution. Works only if /network/syn/geom/pmax is not None
            /sig2  = None   ; sigma for gaussian distribution. Works only if /network/syn/geom/pmax is not None and /sig is None
            /o2o   = False  ; one-to-one creates only one connection for each rGC to a closest LGN cell.
            /a2a   = False  ; all-to-all connections: each rGC connects all dLGN cells. It is the best configuration for btdp
        }
        /NMDA2AMPA  = 2.25  ; ratio between peak of NMDA current to peak of AMPA current
        /AMPA/{
            /tau1   = 1.    ; AMPA rising time constant
            /tau2   = 2.2   ; AMPA falling time contant From Chen & Regehr 2000 Table 1 P10-P14
        }
        /NMDA/{
            /tau1   = 1.    ; NMDA rising time constant
            /tau2   = 150.  ; NMDA falling time contant From Chen & Regehr 2000 Table 1 P10-P14
        }
        /ppr_u0     = 0.3  \; sets presynaptic single spike depression
                            ; it set to obtain paired-pulse ratio 0.73 From Chen & Regehr 2000 Table 1 P10-P14
       /toneExc     = None  ; Conductance for tonic excitation (None to block)
       /toneInh     = None  ; Conductance for GABA_B receptors (None to block)
    }
......
}
"""
#building model parameter tree and process a CLI arguments
mth = methods(stkdefault, 'mth', locals(), argvs = args )
#and using them 
syn_equ="""
da1/dt = -a1/{/AMPA/tua1}/ms :1
da2/dt = -a2/{/AMPA/tua2}/ms :1
dm1/dt = -m1/{/NMDA/tua1}/ms :1
dm2/dt = -m2/{/NMDA/tua2}/ms :1
""".format(**mth["/network/syn"])

2. Help generation.

 python mymodel.py --help

Some general words about the model
The model was coded by me

Usage: mymodel.py [options] [parameters]

Any model parameter can be altered by /parameter_name=parameter_value in command line

Parameters:
/network/geom/shape             = 'H'
                                : H-hexagonal or S-square

/network/geom/x                 = 7
                                : number of columns

/network/geom/y                 = 16
                                : number of raws

/network/gj/geom/p              = 0.3
                                : gap junction probability

/network/gj/geom/unit           = 'n'
                                : if n - probability of connection per neuron. it 'c' - probability of connection per
                                : pear of neurons within maxd distance

/network/gj/geom/maxd           = 1.5
                                : gap junction max distance

/network/gj/geom/dir            = None
                                : gap junction direction (random if None)

/network/gj/r                   = 700.
                                : gap junction resistance

/network/syn/gsyn               = None
                                : synaptic conductance:   if positive - set for everyone,   if negative - set the
                                : portion of minimal synaptic conductance for each neuron,  if None set to 1/3 of the
                                : minimal synaptic conductance for each neuron.

/network/syn/rand               = False
                                : If ture - gsyn randomly init between 0 and /syn/gsyn value

/network/syn/prenorm            = True
                                : If true the total synaptic conductance will be set to /syn/gsyn value for each neuron
                                : in the **beginning**!   It is active only in model building. For dynamical
                                : normalization see /network/btdp

/network/syn/geom/pmax          = None
                                : peak of synaptic probability for exp or exp2 distribution.  If None closest LGN cells
                                : are connected to each rGC

/network/syn/geom/sig           = None
                                : sigma for exponential distribution. Works only if /network/syn/geom/pmax is not None

/network/syn/geom/sig2          = None
                                : sigma for gaussian distribution. Works only if /network/syn/geom/pmax is not None and
                                : /sig is None

/network/syn/geom/o2o           = False
                                : one-to-one creates only one connection for each rGC to a closest LGN cell.

/network/syn/geom/a2a           = False
                                : all-to-all connections: each rGC connects all dLGN cells. It is the best configuration
                                : for btdp

/network/syn/NMDA2AMPA          = 2.25
                                : ratio between peak of NMDA current to peak of AMPA current

/network/syn/AMPA/tau1          = 1.
                                : AMPA rising time constant

/network/syn/AMPA/tau2          = 2.2
                                : AMPA falling time contant From Chen & Regehr 2000 Table 1 P10-P14

/network/syn/NMDA/tau1          = 1.
                                : NMDA rising time constant

/network/syn/NMDA/tau2          = 150.
                                : NMDA falling time contant From Chen & Regehr 2000 Table 1 P10-P14

/network/syn/ppr_u0             = 0.3
                                : sets presynaptic single spike depression  it set to obtain paired-pulse ratio 0.73
                                : From Chen & Regehr 2000 Table 1 P10-P14

/network/syn/toneExc            = None
                                : Conductance for tonic excitation (None to block)

/network/syn/toneInh            = None
                                : Conductance for GABA_B receptors (None to block)

3. Changes in parameters.

Try model with different AMPA time constant as simple as

 python mymodel.py /network/syn/AMPA/tau2=30.5

3. All parameters are saved in a single database

even with some preview

and diff between model parameters

Of course, one can put lightweight statistics into a “parameter file” and analyze it later. I usually compute mean firing rates or R2 synchronization index at the end of the run, so I can also see the diff in results with the diff in parameters. Moreover, I can collect parameters/data through the database and export it to Pandas/NumPy for analyzes.

4. BUT

Both codes for parameter manipulation and for storing big chunks of data are combined in one simulation toolkit simtoolkit. Unfortunately, all of this isn’t in great shape. It needs lots of development and helps with bug fixing. However, whenever I’m trying to finish documentation and create a GitHub page for this project, I’m sinking into words and postponing the GitHub repository. At this point, it works only in my hands, and I’m not sure if it will be helpful to anyone else.

I must spend a couple of days cleaning code and documenting this project at some point, but if you find it interesting to consider as an alternative, I’ll commit to doing this work asap.

RTH

adam · 10 November 2021 14:59

Thanks for the detailed response!

It sounds like your experiments are at a much larger scale than mine, but a lot of needs for data management overlap.

Some of the hierarchical parameter description is quite similar to the custom classes I ended up putting together to describe my simulations. I usually divide things into specifications for:

the time-span / structure of the run
the network
- the neurons
any interventions / stimulation

and then try to have similarly structured specifications for any analysis / post-processing. I’ve found this to be quite effective at the small-to-medium scale of simulation both for reminding me the details of the simulation, and acting as a container for often-used post-processing steps. This also allows for easily encoding default parameter values and passing in only the modifications you want to sweep across. It seems like this functionality is also handled well with your command line options to the script!

Your custom database solution sounds like it has a number of useful features, and I would be interested in trying it out! But for now it looks like a pandas dataframe for summary features + a reference to a filename containing the full timeseries (saved currently using numpy’s savez format ) get’s the job done for me. So I don’t think it’s worth dramatically reprioritizing your time, but please keep me updated on progress!

adam · 10 November 2021 15:40

Some further scattered thoughts as I’ve tried to cobble together a solution:

“Wide” vs “Long”/“Tidy” data formats

I’ve been working less with imperative graphing libraries (like matplotlib) where I have low-level control over the pipeline from file/database → extracting arrays → plotting arrays to declarative visualization frameworks based around “grammar of graphics” (altair / vega-lite / ggplot) where the inputs are usually a dataframe and a column name, and the library then handles extracting the data.

This means there are stricter requirements on what structure the data takes by the time it gets visualized. Specifically altair / vega-lite seem to have a strong preference for a “tidy” format, i.e. one row per observation, rather than grid-like layouts.
I’ve been wrestling with pandas melt() function to handle this conversion but I’m currently finding it a little unintuitive.

@rth if there was a simple way to ensure a “tidy-format” export / rearrangement from your database that would be great.

Misc. considerations a good storage system

plain-text “human-readable” formats v.s. binary
- this has an impact on the ability to visually inspect the integrity of the data which is very nice
- but also seems to impact git’s ability to diff files. Binary files seem very slow to add to git because of this. The standard solution to this seems to be simple to avoid version-controlling binary data.
ability to load partial chunks of data
- this seems to be one of the primary selling points of hdf5
easy of integration with summary visualization tools
- also integration with data-analysis tools like numpy / pandas
tight connection between metadata / summary data and primary timeseries

Additional considerations (which are good to have, but don’t strongly impact my user experience at the moment)

memory efficiency of storage
speed of saving/loading to/from files
portability across platforms (i.e. python versions)

adam · 10 November 2021 18:01

RTH, a colleague who works on machine learning suggested I look into weights and biases which has

incremental logging of diverse data-types
- could fit with your use-case of having to stop mid-simulation to offload data
wrapper functions for managing hyperparameter sweeps
tools for visualizing results in progress

I think the biggest downside is it’s optimized for machine-learning applications, but it seems flexible enough to be able to record biophysical network sims for instance.

This might be worth looking into as a related solution to compare features to.

rth · 10 November 2021 19:33

That was precisely my starting point. I used git as an ‘experimental’ journal for many years to keep model parameters and some messages (like old school experimentalists, who had a barn book-keeping journal to write every preparation and manipulation). BTW, the git tag system is very handy to mark some parameters that produce a particular figure, so you can return and reproduce it (or put instruction on ModelDB). However, I got more sophisticated and diverse models after a while, so git failed to keep all I wanted. Two of the critical demands: that parameters should be annotated somehow and that parameters quite often depend on each other, drove me to develop this database.

My practice (maybe quite unhealthy) is to develop analytic code and visualization tools for each project I’m working on. This is almost because it isn’t possible to predict what kind of results you will have and the direction the project will go. So from this perspective, simtoolkit is universal without any specificity, even for computational neuroscience.

That is a good point, @adam. Because in DB, all parameter names are virtually in the tree structure, there are no reasons that extraction of tables of parameters with parameter names as the names of columns should be a problem. It’s SQL DB, and therefore, this must be only one query internally and should be quite fast (I guess). However, it isn’t implemented (yet) almost because I never use simtoolkit in this way.

I should be more explicit here. There are two file types that stktoolkit will offer: stkdb for parameters and lightweight statistics (SQL like DB internally) and stkdata for chunked bulky data.

stkdata won’t offer parallel writing but parallel reading for sure. Variables in stkdata also virtually in the tree structure. So yes, the data/parameters labeling and representation are consistent through the whole module.

@adam, you are asking good questions. Please continue.

adam · 11 November 2021 15:36

wonderful! sounds like there’s a lot of overlap between how we’re thinking about these issues.

Your discussion of git tags reminds me of an idea I had a while ago, and partially implemented:
automatically stamping a snapshot of the script name and commit id that generates a figure into the background as a “watermark.”
Obviously you’d take this out of production figures. But the idea of having a default behavior of linking all figures back to their generating code (especially with minimal need for manual intervention) seems very useful

Another thought:
Are there any ways such a system could be well-integrated with brian? i.e. an automated network specification export and import. Some version of this may already exist that I’m not aware of, but if it doesn’t it seems like it would be useful, but potentially a lot of work

rth · 11 November 2021 21:37

Now I feel shame that I haven’t githubed the project yet !

I don’t think so! Note, that stkmethods convert everything in parameters into python objects. So …

stkdefaults="""
/a = 1       ; integer
/b = 1.2     ; float
/c = [1,2,4] ; list
/d =(1,2,3)  ;tuple
/e = "Hello STK" ;string
"""
mth = methods(stkdefaults, 'mth', locals(), argvs = args )
mth.generate()
# Now
for n in mth:
   print (n, type(mth[n]))

and the result is

/a <class 'int'>
/b <class 'float'>
/c <class 'list'>
/d <class 'tuple'>
/e <class 'str'>

So one can put all Brian’s equations directly into the model parameter, although it would look cumbersome.

Moreover, one can address a whole subtree, which can be seen as a dictionary.
This also works very well, and I use it a lot!

syn_equ="""
da1/dt = -a1/{/AMPA/tua1}/ms :1
da2/dt = -a2/{/AMPA/tua2}/ms :1
dm1/dt = -m1/{/NMDA/tua1}/ms :1
dm2/dt = -m2/{/NMDA/tua2}/ms :1
""".format(**mth["/network/syn"])

Finally, and that is pretty handy, methods can do calculations on the fly and refer to each other.

>>> stkdefaults="""
... /a = 3 ; just a number
... /b=@/a@*12; just 12 of them
... """
>>> mth = methods(stkdefaults, 'mth', locals(), argvs = None )
>>> mth.generate()
False
>>> for n in mth: print(n, "=", mth[n])
... 
/a = 3
/b = 36

For example, I compute synaptic amplitude normalization in this way and can switch between peak and space-under-the-curve normalizations in one CLI argument /network/syn/AMPA/norm=@/network/syn/AMPA/peack@ This normalization will be recalculated if I add /network/syn/AMPA/tau2=1000.

I think a more in-depth integration with brian is possible but will make simtoolkit a more niche module. Maybe simtoolkit may have submodules that will tune it to a specific simulator more tightly: for example, import unites into defaults .

P.S. I’m going to travel next few days, but I may have a bit of time to work on simtoolkit and push it towards github a little bit

mstimberg · 15 November 2021 15:38

This is an amazing thread, thanks for sharing @adam and @rth ! I’ll move it to the Science, Projects, and Showcases category for now, simply because we’d like to reserve the Support section for things that can be answered (which makes it easier for us to keep track of open support section). But maybe we should have something like a “best practice” section for this kind of discussion?

It’s been awhile since I did large-scale explorations/simulations, so in my case I can usually “get away” with a relatively straightforward directory structure storing all the raw data in a directory per simulation, and having a single csv file with all summary statistics, etc. Regarding the initial question, I’d probably simply have a unique name for each simulation stored in the data frame and have that translate to a directory and/or file name. I’ve never actually tried to put something like raw StateMonitor data into a pandas data frame.

Some more random remarks:

@rth: Your system seems to have some similarities with sumatra (code, paper). I don’t think it has seen much development lately, though…
Regarding Brian integration: Please have a look at the export facilities in the brian2tools package: Base exporter — brian2tools documentation The idea behind this was to make things available in a format that is easy to extend for more specific export facilities. The is currently a “proof-of-concept” exporter that uses this facility to export to a human-readable description in Markdown: Markdown exporter — brian2tools documentation (shoutout to @Vigneswaran who authored all this during his Google Summer of Code).
Regarding HDF5: opinions differ on this, a lot also depends on the size of files. Here are blog posts arguing for the different sides: “Against HDF5”: Cyrille Rossant - Moving away from HDF5 and Cyrille Rossant - Should you use HDF5? , “In favour of HDF5”: http://blog.khinsen.net/posts/2016/01/07/on-hdf5-and-the-future-of-data-management/
I have to say I’m rather in the “anti-HDF5” camp, I quite like the directory-based approach, e.g. the one detailed here: https://www.frontiersin.org/articles/10.3389/fninf.2018.00016/full
Potentially interesting to some of you: The pypet package helps to organize parameter explorations and has some built-in support for Brian: Welcome to pypet’s documentation — pypet 0.5.0 documentation
Regarding versioning big binary data: This is typically handled via git lfs or git annex (basically it stores the data elsewhere, and just puts a text file with the link to the data into the git repo). There are also some more sophisticated solutions built on top of this, e.g. datalad.

Hmm, interesting thought. I wonder why it is not more common to use a figure’s metadata for it! E.g. with matplotlib’s savefig you can do something like plt.savefig('myfig.png', metadata={'script_name': os.path.abspath(__file__)) – there are some standard meta data entries (creator, etc.), but you can add arbitrary keys.

adam · 15 November 2021 18:52

Thanks, there’s lots of great advice / resources here. It seems like the simplest, robust-enough solution currently is indeed to have summary stats in plaintext / csv / yaml then reference the time series as filename to say a numpy binary.

In putting together the notebook for the gaussian delay issue I finally bit the bullet on figuring out how pandas hierarchical indexing works. Here’s a table where the highest index is the neuron group name, and the lower index is the neuron id within that group:

this is particularly convenient for plotting libraries that like DataFrames (plotly, altair etc.)
where you can create all your subplots with:

fig = px.line(dfm, x='time [ms]' ,y='voltage', color='population, facet_row='total_neuron_idx')

(with the caveat that this requires “melting” the dataframe in a slightly unintuitive way)

same data plotted, faceting along different dimensions, only changing a single line of code:

fig = px.line(dfm, x='time [ms]', y='voltage', facet_row='population', color='neuron')

hierarchical_voltage2

This snippets for how I achieved this will be in the gaussian delay notebook, but I’m happy to turn them into a pull request if they end up being useful.

I saw that there’s been some prior work in exporting Brian traces to pandas dataframes:

storing current state to dataframe using get_states()
related github issue Exporting/importing Brian data from/to other formats#306
PandasImportExport class docs

but as far as I can tell this can only store 1D variables to dataframes, is this correct?

EDIT: now under version control at GitHub - awillats/brian_delayed_gaussian

mstimberg · 16 November 2021 20:49

This is correct, the pandas export only handles 1D arrays. In fact, it even fails with an ugly error message for 2D arrays It would be great to support this somehow… The general architecture of the exporter is to deal with each variable individually (and you can ask get_states to only return a single variable). This approach does not quite work for recorded 2D array, since you need at least the t variable to interpret things. But maybe it is worth changing the export so that it handles variables like i and t in a special way?

adam · 16 November 2021 20:58

yeah, for my example above, my wrapper function assigns the channels of the monitor output to the corresponding column of the dataframe, then appends the time [ms] column afterwards.

It makes sense to me for export methods to always add t or at least i as mandatory indexing variables. One minimal way to do this with pandas would be to assign the simulation index to the row index with the set_index() method. This would mean not having to add an additional column just to do this job.
You could then have an optional append_timestamp flag to include t as a new column

EDIT: I have my work-in-progress implementation for expanding voltage monitors to hierarchical dataframes here:

github.com

awillats/brian_delayed_gaussian/blob/main/brian_to_dataframe.py

'''
Open issues:
- Does hierarchical table construction depend on an equal number of neurons per group?
    - it shouldn't have to
'''
#%%

import pandas as pd


#%%
def volt_monitors_to_hier_df(all_monitors, group_names, neuron_names):
    dfh = hier_df_from_lists(group_names, neuron_names)
    for idx,mon in enumerate(all_monitors):
        data_dict = mon.get_states(['t','v'], units=False)
        dfh = expand_volt_monitor_to_hier_df(data_dict, group_name=group_names[idx], df=dfh)

    dfh['time [ms]'] = data_dict['t']
    return dfh

This file has been truncated. show original

kjohnsen · 20 January 2025 20:10

Reviving this thread since I’m doing a large scale exploration of parameters myself now and am running into the unfortunate fact that I can’t just pickle the whole Brian Network…

I’m experiencing some remorse at not having read this thread sooner, but the solution I’ve landed on fares comparably well with these, I think. I approached this problem from the “workflow engine” perspective. The reason is, these systems (like make) can cache intermediate results and rerun only what needs to be run. So no need to manually walk through part of a complex pipeline after changing parameter X.

My pipeline with Snakemake looks like this: I define the step (“rule”) of my workflow that produces a figure, which depends on a table of results tabulated across conditions. This table is defined in another rule of my workflow and is put together from the results of individual conditions. When I ask for a given figure, conditions that haven’t been run or need to be rerun (due to changes in parameters or input files) and the data I need is saved (spike monitor data, summary statistics, etc.). The post-processing rules put together the table and the figure. Each condition’s data is saved according to a programmatically generated filename, so I can go back and explore individual runs. Snakemake even has a “report” generation feature that lets you browse results.

For why I chose Snakemake (there are too many workflow engines out there), here are the 3 top candidates I found:

redun - looks like a great design, automatically caches outputs of functions and tracks dependence on data, but not great adoption
prefect - similar to redun, came earlier and has much bigger adoption. seems to be big in ML data engineering
snakemake - built for bioinformaticists, pretty great for scientific computing projects with file intermediates, more tailored to academic research than data engineering

I ended up going with Snakemake since it had SLURM submission support, though I later found that through Dask you can accomplish the same with Prefect. I would try Prefect if I did it over again.

mstimberg · 22 January 2025 15:52

Hi @kjohnsen. Thanks for sharing your experience. I’ve never used a full workflow engine in this context, I’ve only used functions decorated with joblib’s Memory class in order to not re-calculate values for known parameter values. But this is obviously less powerful and only works for relatively simple setups. I’ve used snakemake a bit and also heard good things about it in general. I am curious: what do you think is missing from it that leads you to say “I would try Prefect if I did it over again”?

As an aside: a long time ago, Andrew Davison created a quite nice tool called sumatra which was specifically tailored to the use case of running simulations with different parameters. It does not seem to be developed any more, though: GitHub - open-research/sumatra

kjohnsen · 22 January 2025 16:34

I did see sumatra from the thread above, but that it wasn’t developed anymore. Like I said, there are way, way, way too many workflow engines out there that can do similar things, so I tried to narrow down choices to the biggest/best ones. Prefect has two main advantages over Snakemake from what I can tell: wider adoption (~17k vs. ~3k GitHub stars) and caching task results automatically (under the hood serializing and storing in a DB). This is much easier for a pure Python workflow. By contrast, Snakemake works with file intermediates between steps, so if you’re using Python (like I am with Brian), you need to go through the extra trouble of serializing and unserializing between steps. This makes more sense for bioinformatics where you’re using a bunch of (other people’s) command line tools that use files for input and output. I chose Snakemake mainly because it had support for SLURM execution, but I since found that this is possible (if not as fully featured) on Prefect through its Dask executor. In general, Snakemake is more tailored to research while Prefect is more tailored to industry ML/data engineering.

mstimberg · 22 January 2025 16:44

Many thanks for the clarification, I definitely need to look into Prefect more.

On a slightly unrelated note : Do you know about the OCNS/INCF Software Working group (disclosure: @sanjayankur31 and I are co-chairing it ) Home | INCF/OCNS Software WG ? This kind of topic would be interesting to present/discuss in that context, I think.

kjohnsen · 22 January 2025 19:52

I would be happy to contribute to a WG discussion on this, though I will probably be more useful in a few months once I have this Snakemake project running end-to-end—this is the first time I’ve used it.

Topic		Replies	Views
Optimization of simulation time using TimedArray Support	1	19	2 June 2025
how to get all data from all monitors? Science, Projects, and Showcases	3	699	19 February 2021
Spiking neural network for time-series prediction Science, Projects, and Showcases	0	387	30 March 2022
Exploring STDP parameters through one network Support performance , synapses	2	597	22 April 2022
Semi-explicit equation Support	5	192	26 June 2023