Best practices for large, shared Python Datasets?

MJC · January 14, 2025, 2:14pm

I’m building a model that uses historical data that is then applied forward to future periods using monte carlo simulations - lets say for 3 future years worth of monthly data, based on ~8 years of historical data (organized into a large table hard-coded in my workbook).

The monthly calculations are somewhat adjustable, so I can tweak what percentiles represent outliers in the data, or even which years’ worth of historical data to include in the projection of future data- so I essentially have a large table with all of the months I’m projecting data for with various toggles/adjustment cells for each month.

Originally I built the code in the M$ implementation of Python in Excel to essentially work in one big script using global variables that would parse all the historical data, read in the parameters for the future months, and do all the calculations. Then my output tab would reference those global variables by year/month to display the statistics and outputs associated with it. This worked for a single month or two of data, but would time-out prior to completing the calculation if I extended it out further.

Next I tried splitting the code up so that each step was broken into individual functions and each month’s worth of data was filtered out and returned as python objects that got referenced in the next cell - so my table of monthly parameters now also included all of the outputs that would be used in subsequent steps. This worked, but was very slow, and even with manual calculation mode enabled, there were times where Excel, in its infinite wisdom, would decide to recalculate everything from the start even when only half the cells had completed their calculation. I also kept running out of Cloud Compute time - even when I upgraded to a paid license (apparently even paid licenses don’t have unlimited compute).

So that brought me to Anaconda Code and an attempt to move this workbook to a local computation. I tried simply migrating the code that I had working above into Anaconda.code, but I still kept running into issues with Excel causing a recalculation before all of the cells had completed calculating - and, in fact, unlike the Cloud version, Excel would eventually hang and the calculation would never fully complete no matter how long I left it. Now I’m debating trying to move back to my initial plan of having all the code in one cell with global variables, but before spinning my wheels on that, I thought I’d see if anyone here had any advice or suggestions on how best to approach this problem.

Am I best perhaps trying to break this down into independent blocks that run in isolated mode and simply reference each other? Should I continue trying to have a large set of cells running in linked mode all referencing each other? Does having everything calculate in a single cell and then referencing that one cell from all output cells make better sense?

Open to any thoughts or ideas!

Thanks

OwenPrice · January 14, 2025, 3:08pm

Hi Michael

It’s an interesting case. Some things to try:

Extend the timeout for Python in Excel:

Screenshot 2025-01-14 at 8.02.46 AM684×409 16.1 KB
Regarding this:

What is your data source? Is it a Power Query query? If it is, could you use Power Query to do the heavy lifting of parsing and data transformation and leave the modeling to Python? If it’s not a query, could you create one from your Table of input historical data and do similar, then connect Python to the connection-only query?

Optimize the code itself. I appreciate this is a bit of a nebulous answer, but often a new approach can yield significant improvements. If you’re willing to share the code (or some version that doesn’t expose sensitive information) in this thread, I could see if anything jumps out in terms of performance.

Suggestion 1 applies to Python in Excel, but 2 and 3 are applicable to both Python in Excel and Anaconda Code, with the small addendum that Anaconda Code does not yet connect to connection-only queries (they must be loaded to a spreadsheet first).

Jim_Kitchen · January 14, 2025, 3:18pm

In Anaconda Code, are you using Linked Mode or Isolated Mode for the cells?

Linked Mode uses the same behavior as Python in Excel, allowing you to reference previously defined local variables, but also requiring all Python cells to be recomputed every time any cell changes.

Isolated Mode with Code Object output will avoid the recompute issue and might allow your calculations to finish.

MJC · January 14, 2025, 4:34pm

@OwenPrice Good to know on the timeout thing - I didn’t realize that setting existed - I’ll look into it, thoughit won’t solve the issue of running out of compute time. In terms of your second question, for now the datasource is just a hard-coded table on one of the sheets in the workbook. I figured that would be easiest/fastest; and in fairness the problem doesn’t seem to be associated with crunching the historical data - that still just takes a matter of a couple seconds. As for optimizing the code, I’ll see if I can distill some of what I’ve got down a bit to where I can attach it in a subsequent reply. ChatGPT (lol) has helped a fair bit already and, in fact, the code runs quite efficiently from the command line version I created.

@Jim_Kitchen So far I have tried using linked mode, because I had originally coded it in =PY() where I don’t have the option to run it in isolated mode. The problem, as I noted before, is for some reason sometimes Excel seems to decide it needs to recalculate all cells even while Excel is still calculating the first set (ie even without being triggered by a cell change), and this seems to occur with both Anaconda Code as well as =PY(). My next step is to try running it in isolated mode, but then I run into the same issue I posted about here: #VALUE! error for cells with a dict over a certain size - Python in Excel / Anaconda Code - Anaconda Community

MJC · January 14, 2025, 7:54pm

@OwenPrice, I pulled together a sample of the code - this section is just the first step, prepping the data for later calculation steps.

To begin with, I have two tables in Excel, one called “Actual_Data” containing 113,232 rows with the following structure:

Timestamp	Period	Value
2012-02-01T00:00:00-05:00	0	284.07
2012-02-01T01:00:00-05:00	0	206.01
2012-02-01T02:00:00-05:00	0	221.21
…
2024-12-31T21:00:00-05:00	1	292.37
2024-12-31T22:00:00-05:00	1	38.16
2024-12-31T23:00:00-05:00	0	213.26

“Period” will only ever be a 0 or a 1, and the timezone component is based on Eastern Time and as such will be either -05:00 or -04:00.

The second table is called “Forecast_Data” containing 322,450 rows with the following structure:

PubDate	Period	Value	FCMonth
1/3/2012	0	220.14	2/1/2012
1/3/2012	0	277.41	3/1/2012
1/3/2012	0	115.61	4/1/2012
1/3/2012	0	192.88	5/1/2012
1/3/2012	0	226.09	6/1/2012
…
1/10/2025	1	61.18	10/1/2028
1/10/2025	1	199.91	11/1/2028
1/10/2025	1	273.88	12/1/2028

Again, “Period” is a 0 or a 1. “PubDate” is the “publish date” of the forecast, and “FC Month” is the month for which the forecast is applicable for (month and period).

My code loads the data and adjusts to make all datetimes timezone aware. One step of the process to reduce the dataset is to filter the forecast data to look at only forecasts generated on a certain date, plus or minus “deltaDays”, but for any given year. When I run the code locally in Python 3.13.0 (IDLE Shell) I get the following output:

After Loading Data: 0.17 seconds
After Filtering: 3.33 seconds
After parsing Timestamp: 3.76 seconds
Complete: 3.86 seconds
                         PubDate  Period   Value                   FCMonth  Year  Month         PubDate_month_day
123646 2017-01-04 00:00:00-05:00       1   83.51 2020-12-01 00:00:00-05:00  2020     12 2024-01-04 00:00:00-05:00
123742 2017-01-05 00:00:00-05:00       1   93.51 2020-12-01 00:00:00-05:00  2020     12 2024-01-05 00:00:00-05:00
123838 2017-01-06 00:00:00-05:00       1   92.36 2020-12-01 00:00:00-05:00  2020     12 2024-01-06 00:00:00-05:00
...
198970 2020-01-21 00:00:00-05:00       1   48.42 2020-12-01 00:00:00-05:00  2020     12 2024-01-21 00:00:00-05:00
199066 2020-01-22 00:00:00-05:00       1  119.32 2020-12-01 00:00:00-05:00  2020     12 2024-01-22 00:00:00-05:00
199162 2020-01-23 00:00:00-05:00       1  282.00 2020-12-01 00:00:00-05:00  2020     12 2024-01-23 00:00:00-05:00

(57 lines)

Running that same code in Anaconda Code looks more like this:

After Loading Data: 0.53 seconds
After Filtering: 672.27 seconds
After parsing Timestamp: 697.88 seconds
Complete: 697.98 seconds

though it’s worth noting that at least the result is correct - but the time required to do the filtering, and even just the parsing timestamps step, is crazy. I’m not sure how it works under the hood, but I’d be interested in any ideas you might.

Code for the above is as follows:

# Start benchmark timing
start_time = time.time()

testDate = REF("HEDGE_DATE")   # datetime(2025, 1, 14)
deltaDays = REF("DELTA_DAYS")  # 10 

testDate = testDate.replace(tzinfo=pytz.timezone('America/New_York'))

historical_data = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))

act_df = to_df(REF("Actual_Data"))  # pd.read_csv('C:/Temp/ActData.csv')
fc_df = to_df(REF("Forecast_Data")) # pd.read_csv('C:/Temp/FCData.csv')

elapsed_time = time.time() - start_time
print(f"After Loading Data: {elapsed_time:.2f} seconds")

fc_df = fc_df.drop_duplicates(subset=['PubDate', 'Period', 'FCMonth'])

# Convert Trade Date and FCMonth to datetime
fc_df['PubDate'] = pd.to_datetime(fc_df['PubDate'], format='%m/%d/%Y')
fc_df['FCMonth'] = pd.to_datetime(fc_df['FCMonth'], format='%m/%d/%Y')

fc_df['PubDate'] = fc_df['PubDate'].dt.tz_localize('America/New_York')
fc_df['FCMonth'] = fc_df['FCMonth'].dt.tz_localize('America/New_York')

fc_df['Year'] = fc_df['FCMonth'].dt.year
fc_df['Month'] = fc_df['FCMonth'].dt.month

# Filter forecasts to include PubDate within testDate +/- deltaDays for any year
fc_df['PubDate_month_day'] = fc_df['PubDate'].apply(lambda x: x.replace(year=2024)) # Year is arbitrary as long as it is a leap year
start_date = testDate.replace(year=2024) - timedelta(days=deltaDays)
end_date = testDate.replace(year=2024) + timedelta(days=deltaDays)
fc_df = fc_df[(fc_df['PubDate_month_day'] >= start_date) & (fc_df['PubDate_month_day'] <= end_date)]

elapsed_time = time.time() - start_time
print(f"After Filtering: {elapsed_time:.2f} seconds")

# Parse the 'Timestamp' column and convert to America/New_York timezone
act_df['Timestamp'] = pd.to_datetime(act_df['Timestamp'], utc=True)
act_df['Timestamp'] = act_df['Timestamp'].dt.tz_convert('America/New_York')

elapsed_time = time.time() - start_time
print(f"After parsing Timestamp: {elapsed_time:.2f} seconds")

# Extract Year and Month based on the localized time
act_df['Year'] = act_df['Timestamp'].dt.year
act_df['Month'] = act_df['Timestamp'].dt.month

for (year, month, Period), group in act_df.groupby(['Year', 'Month', 'Period']):
    historical_data[year][month][Period]['ACT'] = group

for (year, month, Period), group in fc_df.groupby(['Year', 'Month', 'Period']):
    historical_data[year][month][Period]['FC'] = group

elapsed_time = time.time() - start_time
print(f"Complete: {elapsed_time:.2f} seconds")

print(historical_data[2020][12][1]['FC'])

OwenPrice · January 14, 2025, 9:11pm

I think the slowdown is caused by this line:

fc_df['PubDate_month_day'] = fc_df['PubDate'].apply(lambda x: x.replace(year=2024))

datetime objects are immutable, so this line is creating a new datetime with the replaced year, row by row over your 320k row dataframe.

I’m sure there are lots of other ways to do it, but here’s one:

fc_df['PubDate_month_day'] = pd.to_datetime(
    '2024-' + 
    fc_df['PubDate'].dt.strftime('%m-%d')
).dt.tz_localize('America/New_York')

This should be faster because it’s using vectorized operations instead of apply, which is row by row.

Edit:
I just tested this revised method using Python in Excel - the speed is now comparable to your IDLE results.

This code:

import time
import pytz
from datetime import timedelta

start_time = time.time()
fc_df = xl("Forecast_Data", headers=True) 
print(f"{fc_df.shape=}") 
fc_df = fc_df.drop_duplicates(subset=['PubDate', 'Period', 'FCMonth'])
fc_df['PubDate'] = pd.to_datetime(fc_df['PubDate'], format='%m/%d/%Y')
fc_df['FCMonth'] = pd.to_datetime(fc_df['FCMonth'], format='%m/%d/%Y')
fc_df['PubDate'] = fc_df['PubDate'].dt.tz_localize('America/New_York')
fc_df['FCMonth'] = fc_df['FCMonth'].dt.tz_localize('America/New_York')
fc_df['Year'] = fc_df['FCMonth'].dt.year
fc_df['Month'] = fc_df['FCMonth'].dt.month
testDate = xl("HEDGE_DATE")   # datetime(2025, 1, 14)
deltaDays = xl("DELTA_DAYS")  # 10 
testDate = testDate.replace(tzinfo=pytz.timezone('America/New_York'))
fc_df['PubDate_month_day'] = pd.to_datetime(
    '2024-' + 
    fc_df['PubDate'].dt.strftime('%m-%d')
).dt.tz_localize('America/New_York')

elapsed_time = time.time() - start_time
print(f"After replacing year: {elapsed_time:.2f} seconds")

start_date = testDate.replace(year=2024) - timedelta(days=deltaDays)
end_date = testDate.replace(year=2024) + timedelta(days=deltaDays)
fc_df = fc_df[(fc_df['PubDate_month_day'] >= start_date) & (fc_df['PubDate_month_day'] <= end_date)]

elapsed_time = time.time() - start_time
print(f"After filtering: {elapsed_time:.2f} seconds")
fc_df

Gives this:

fc_df.shape=(334600, 4)
After replacing year: 3.61 seconds
After filtering: 3.62 seconds

Topic		Replies	Views
Optimizing Performance Python in Excel	3	588	September 12, 2023
About the Python in Excel category Python in Excel	9	2456	April 18, 2024
Do I have to download packages into my mac or can I do everything in the cloud? How do I...?	3	454	September 12, 2023
Sharing notebooks and datasets Product Help	1	293	September 19, 2023
Please help! Does anyone got a list of financial codes like .merge, .drop etc. for Jupyter Anaconda & Miniconda	0	355	January 19, 2023

Best practices for large, shared Python Datasets?

Related topics