Data persistence between re-calculations

Hi there, I’m wondering if there is any way to maintain data persistence in either “Python in Excel” or Anaconda Code. To be clear, I don’t mean within the execution lifecycle (eg. instantiating “foo” in cell A1 and then using it in cell B2 when in “linked mode”). I mean within the Excel session.

For example, upon first execution, I’d like to instantiate “foo” in cell A1. But then I’d like to be able to do something like this, in cell A1:
if “foo” doesn’t exist, create and populate it
else if “foo” exists and is empty or if “forceRefresh” is true, populate it
else use it as-is.

The goal would be to avoid recalculating large data-sets every time the workbook tries to recalculate. Given the cloud nature of “Python In Excel” I suspect this wouldn’t be possible, but perhaps it would in Anaconda Code? Or is it already possible somehow?

There are two options within Anaconda Code.

  1. Define foo in Imports and Definitions. This is only run once each time you modify the Imports and Definitions.
  2. Use Isolated Mode and define foo with Code Object output. The use a REF to grab a reference to the Python object.

Both of these will persist foo across executions of other Anaconda Code cells.

Thanks @Jim_Kitchen - I’ll give that a try.

In the meantime, I have a related issue (which may be solved by what you describe). But basically I’m trying to organize my workbook like this:
I have one “init” cell run in linked mode which is used to run all my reference data that is run in all my subsequent calculations. The idea here is that it is largely static and ideally would only calculate once when the sheet is opened. Then I have other cells that perform calculations using that data.

So in this following example, I have my “init” cell in B2, and then I have two subsequent cells in D4 and D5 that call a function defined in B2, and make use of global variables defined in B2. My output is this:

Cell B2: 2025-01-22 12:00:25.252000
After Loading Data: 0.54 seconds
After Filtering: 32.88 seconds
After parsing Timestamp col (1): 43.01 seconds
After parsing Timestamp col (2): 47.96 seconds
Source Data Parsing Complete: 48.03 seconds
Created Merged Data: 50.78 seconds

Cell B2: 2025-01-22 12:01:16.036000
After Loading Data: 0.57 seconds
After Filtering: 34.55 seconds
After parsing Timestamp (1): 46.49 seconds
After parsing Timestamp (2): 52.02 seconds
Source Data Parsing Complete: 52.09 seconds
Created Merged Data: 55.10 seconds

Cell B2: 2025-01-22 12:02:30.600000
After Loading Data: 0.61 seconds
After Filtering: 36.39 seconds
After parsing Timestamp (1): 47.42 seconds
After parsing Timestamp (2): 52.50 seconds
Source Data Parsing Complete: 52.58 seconds
Created Merged Data: 55.58 seconds

Cell B2: 2025-01-22 12:03:26.186000
After Loading Data: 0.60 seconds
After Filtering Forwards: 34.12 seconds
After parsing LMP Timestamp: 44.60 seconds
After parsing Production Timestamp: 49.69 seconds
Source Data Parsing Complete: 49.76 seconds
Created Merged Data: 52.50 seconds

Cell D4: 2025-01-22 12:04:19.986000
Sim Started 2025/1/1: 53.80 seconds
Sim Data Filtered: 53.88 seconds
Sim Data Complete: 53.89 seconds

Cell D5: 2025-01-22 12:04:20.213000
Sim Started 2025/1/0: 54.03 seconds
Sim Data Filtered: 54.11 seconds
Sim Data Complete: 54.11 seconds

What I’m trying to identify (and prevent) is why cell B2 executed 4 times prior to D4 and D5 executing - but clearly the calculation time overhead is significant.

I’m not sure if it’s something in the way Excel determines when a recalc is required, or something in how I have linked them, or something in how.

If I go with your 2nd suggestion (running in isolated mode and referencing that cell), hopefully I don’t end up in the same boat where it keeps recalculating that independent cell repeatedly.

Also, when trying #2, I’m still running into a #VALUE! error, similar to:
https://community.anaconda.cloud/t/value-error-for-cells-with-a-dict-over-a-certain-size

However, in this case I’m now also getting an error in the log:

---------------------------------
Traceback (most recent call last):
  File "/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 1693, in as_array
    arr = self._interleave(dtype=dtype, na_value=na_value)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 1734, in _interleave
    arr = blk.get_values(dtype)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 2244, in get_values
    values = values.astype(object)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/core/arrays/datetimes.py", line 739, in astype
    return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/core/arrays/datetimelike.py", line 458, in astype
    converted = ints_to_pydatetime(
                ^^^^^^^^^^^^^^^^^^^
MemoryError

My code is roughly:

merged_data = []
for year in historical_data:
    for month in historical_data[year]:
        for peak in historical_data[year][month]:
            act_group = historical_data[year][month][period].get('ACT')
            fc_group = historical_data[year][month][period].get('FC')
            if act_group is not None and fc_group is not None:
                merged_df = pd.merge(act_group, fc_group, on=['Year', 'Month', 'Period'], how='inner', suffixes=('_ACT', '_FC'))
                merged_df['Relative_Delta'] = (merged_df['ActVal'] - merged_df['FCVal']) / merged_df['FCVal']
                merged_df = merged_df.assign(Year=year, Month=month, Period=period)
                merged_data.append(merged_df)

merged_data = pd.concat(merged_data, ignore_index=True)

merged_data

(this results in a #VALUE!) error with the log output above

However, if I change that last line to

"merged_data"

so that it just returns a string then it works fine without errors.

But the fact that that works and “returning” the data structure (to be used as a reference) fails with an error tells me it’s something associated with how it is represented in the spreadsheet

EDIT:
I also tried implementing your first suggestion, by defining ‘foo’ in “Imports and Definitions”. The problem I run into here is that I use the REF() function to define the source for the data that goes into ‘foo’, but REF() is not available yet.

I tried simply defining ‘foo’ in there, eg

foo = pd.DataFrame()

Then populating it in cell A1 for example. But although the variable itself persists, the data does not - every time A1 recalulates ‘foo’ starts empty. I thought perhaps adding

global foo

at the top might solve that, but it does not

import pandas as pd
import sqlite3

Example DataFrame

df = pd.DataFrame({‘A’: [1, 2, 3], ‘B’: [4, 5, 6]})

Save to Excel

df.to_excel(‘data.xlsx’, index=False)
df_loaded = pd.read_excel(‘data.xlsx’)

Save to CSV

df.to_csv(‘data.csv’, index=False)

Load from CSV

df_loaded = pd.read_csv(‘data.csv’)
#import sqlite3

If your data is more complex or you need to persist larger datasets, consider using a SQLite database or other databases supported in Python.

Connect to SQLite database

conn = sqlite3.connect(‘data.db’)

Save DataFrame to SQL

df.to_sql(‘table_name’, conn, if_exists=‘replace’, index=False)

Load DataFrame from SQL

df_loaded = pd.read_sql(‘SELECT * FROM table_name’, conn)