Open in GitHub

Open In Colab

Alt text

Forecasting US Cocoa Bean Imports

Data: Census Data 2020-2024: US Cocoa Bean Imports

Goal: The goal of this project is to forecast the next 3 months (1st 3 months of 2025) of US Cocoa Bean Imports (in terms of Weight) which would ultimately be used to estimate revenue.

Alt text

Below is a preview of the results of the forecasting models that predict the next 3 months of US Cocoa Bean Imports. They were evaluated on MAE (mean absolute error) and MAPE (mean absolute percentage error), which averages the absolute error and %s forecasts for all the horizon/cutoff splits. H1, H2, H3 are horizon 1,2,3 - the # of months after cutoff date (where data is cut before forecast). The Triple Exponential Smoothing model on the raw data (non-log transformed) in the 1st row performed the best overall (see results at end for detail):

In [1]:
from IPython.display import Image, display
display(Image(url="https://raw.githubusercontent.com/lindsayalexandra14/ds_portfolio/main/1_projects/machine_learning/time_series_and_survival/cocoa_imports/cocoa_imports_final_model_performance.png",
    width=800))
No description has been provided for this image

Final Model Fit & Forecast:

In [2]:
display(Image(url="https://raw.githubusercontent.com/lindsayalexandra14/ds_portfolio/main/1_projects/machine_learning/time_series_and_survival/cocoa_imports/cocoa_imports_final_model_plot.png",
    width=800))
No description has been provided for this image

Alt text

Install Packages:

In [ ]:
!pip install pmdarima

Import Libraries:

In [4]:
import pandas as pd
import numpy as np
from prophet import Prophet
import itertools
import time
from prophet.diagnostics import cross_validation, performance_metrics
import plotly.express as px
import cmdstanpy
import seaborn as sns
import matplotlib.pyplot as plt
from prophet.diagnostics import cross_validation
from tabulate import tabulate
from prophet.diagnostics import performance_metrics
import logging
logging.getLogger('cmdstanpy').setLevel(logging.WARNING)
import requests
import matplotlib.ticker as mtick
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf, month_plot, quarter_plot
from statsmodels.tsa.stattools import adfuller
import warnings
from statsmodels.tsa.api import ExponentialSmoothing
import statsmodels.api as sm
import prophet
import pmdarima as pm
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from matplotlib.ticker import FuncFormatter
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import matplotlib.ticker as ticker
import os
os.environ["TQDM_NOTEBOOK"] = "false"

warnings.filterwarnings("ignore")

Custom Palette:

In [5]:
custom_palette = [
    '#cb95f9',
    '#FAC287',
    '#895b3a',
    '#8383F7',
    '#C2C8FF',
    '#CACACC',
    '#C0F0EF',
    '#101827',
    '#1bc2bb'
]
In [6]:
sns.palplot(custom_palette)
plt.show()
No description has been provided for this image

Alt text

I pulled the data from the US census.gov API for 2020-2024. It includes the total estimate of US Cocoa Bean Imports (weight in lbs, spend in $s), broken out by region/country:

Example URL:

https://api.census.gov/data/timeseries/intltrade/imports/enduse?get=CTY_CODE,CTY_NAME,I_ENDUSE,I_ENDUSE_LDESC,GEN_VAL_MO,CON_VAL_MO&I_ENDUSE=00010&YEAR=2013&MONTH=02

In [7]:
base_url = "https://api.census.gov/data/timeseries/intltrade/imports/enduse"
get_fields = "CTY_CODE,CTY_NAME,I_ENDUSE,I_ENDUSE_LDESC,GEN_VAL_MO,CON_VAL_MO,AIR_WGT_MO,CNT_WGT_MO,VES_WGT_MO"
I_ENDUSE = "00010"
start_year = 2020
end_year = 2024
months = range(1, 13)

all_data = []

for year in range(start_year, end_year + 1):
    for month in months:
        time_str = f"{year}-{month:02}"
        url = f"{base_url}?get={get_fields}&I_ENDUSE={I_ENDUSE}&time={time_str}"
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json()
            headers = data[0]
            values = data[1:]
            all_data.extend(values)
        else:
            print(f"Error fetching {time_str}: {response.status_code}")

df = pd.DataFrame(all_data, columns=headers)
print(df)
     CTY_CODE            CTY_NAME I_ENDUSE I_ENDUSE_LDESC GEN_VAL_MO  \
0        6040    PAPUA NEW GUINEA    00010    COCOA BEANS    1825500   
1        2489             GRENADA    00010    COCOA BEANS       4172   
2        3010            COLOMBIA    00010    COCOA BEANS     254616   
3        3070           VENEZUELA    00010    COCOA BEANS     281799   
4        3310             ECUADOR    00010    COCOA BEANS    9281115   
...       ...                 ...      ...            ...        ...   
3018     2470  DOMINICAN REPUBLIC    00010    COCOA BEANS    8116783   
3019     7830            TANZANIA    00010    COCOA BEANS     105958   
3020     7880          MADAGASCAR    00010    COCOA BEANS     142291   
3021     2XXX     CENTRAL AMERICA    00010    COCOA BEANS    8116783   
3022     3XXX       SOUTH AMERICA    00010    COCOA BEANS   77767295   

     CON_VAL_MO AIR_WGT_MO CNT_WGT_MO VES_WGT_MO I_ENDUSE     time  
0       1825500          0     731520     731520    00010  2020-01  
1          4172       1090          0          0    00010  2020-01  
2        254616          0      75950      75950    00010  2020-01  
3        281799          0     113401     113401    00010  2020-01  
4       9281115          0    3359617    3359617    00010  2020-01  
...         ...        ...        ...        ...      ...      ...  
3018    8116783          0    1086180    1086180    00010  2024-12  
3019     105958          0      12826      12826    00010  2024-12  
3020     142291        210      13219      13219    00010  2024-12  
3021    8116783          0    1086180    1086180    00010  2024-12  
3022   77767295       5188    9285205    9285205    00010  2024-12  

[3023 rows x 11 columns]
In [8]:
df.to_csv("us_cocoa_imports_2020_2024.csv", index=False)

Variables:

https://api.census.gov/data/timeseries/intltrade/imports/enduse/variables.html

Variable Dictionary:

In [9]:
variable_dict = {
  "CTY_CODE": "Country Code",
  "CTY_NAME": "Country Name",
  "I_ENDUSE": "Import End Use Code",
  "I_ENDUSE_LDESC": "Import End Use Long Description",
  "GEN_VAL_MO": "General Imports Total Value",
  "CON_VAL_MO": "Imports for Consumption, Total Value",
  "AIR_WGT_MO": "Air Shipping Weight",
  "CNT_WGT_MO": "Containerized Vessel Shipping Weight",
  "VES_WGT_MO": "Vessel Shipping Weight"}
In [10]:
print(tabulate(variable_dict.items(), headers=["Variable", "Description"]))
Variable        Description
--------------  ------------------------------------
CTY_CODE        Country Code
CTY_NAME        Country Name
I_ENDUSE        Import End Use Code
I_ENDUSE_LDESC  Import End Use Long Description
GEN_VAL_MO      General Imports Total Value
CON_VAL_MO      Imports for Consumption, Total Value
AIR_WGT_MO      Air Shipping Weight
CNT_WGT_MO      Containerized Vessel Shipping Weight
VES_WGT_MO      Vessel Shipping Weight

Alt text

The data was complete and there were no missing values:

There are ~3K monthly records for the timeframe that I pulled.

In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3023 entries, 0 to 3022
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   CTY_CODE        3023 non-null   object
 1   CTY_NAME        3023 non-null   object
 2   I_ENDUSE        3023 non-null   object
 3   I_ENDUSE_LDESC  3023 non-null   object
 4   GEN_VAL_MO      3023 non-null   object
 5   CON_VAL_MO      3023 non-null   object
 6   AIR_WGT_MO      3023 non-null   object
 7   CNT_WGT_MO      3023 non-null   object
 8   VES_WGT_MO      3023 non-null   object
 9   I_ENDUSE        3023 non-null   object
 10  time            3023 non-null   object
dtypes: object(11)
memory usage: 259.9+ KB

Add "weight" variable, which is a combination of air, container, and vessel weights:

In [12]:
df["Weight"] = df["AIR_WGT_MO"].astype(int) + df["CNT_WGT_MO"].astype(int) + df["VES_WGT_MO"].astype(int)

Drop unneeded variables (country codes, cocoa beans code & description, consumption value (vs. general value) - only used for consumption vs. for anything, individual weights:

In [13]:
df.drop(columns=["CTY_CODE","I_ENDUSE","I_ENDUSE_LDESC","CON_VAL_MO","AIR_WGT_MO","CNT_WGT_MO","VES_WGT_MO"], inplace=True)

Rename country, spend, and month:

In [14]:
df.rename(columns={'CTY_NAME':'Country'}, inplace=True)
df.rename(columns={'GEN_VAL_MO':'Spend'}, inplace=True)
df.rename(columns={'time':'Import Date'}, inplace=True)

Change datatypes:

In [15]:
df['Import Date'] = pd.to_datetime(
    df['Import Date'],
    format='%Y-%m',
    errors='raise'
)
In [16]:
df['Year'] = df['Import Date'].dt.year
In [17]:
df['Month'] = df['Import Date'].dt.month
In [18]:
df['Spend']=df['Spend'].astype(int)

Add field for price per pound:

In [19]:
df['Price per Pound'] = df['Spend'] / df['Weight']

Field creation introducted NAs, fill NAs with 0:

In [20]:
df['Price per Pound']=df['Price per Pound'].fillna(0)

View country names:

In [21]:
df['Country'].unique()
Out[21]:
array(['PAPUA NEW GUINEA', 'GRENADA', 'COLOMBIA', 'VENEZUELA', 'ECUADOR',
       'PERU', 'BRAZIL', 'FIJI', "COTE D'IVOIRE", 'GHANA',
       'DEMOCRATIC REPUBLIC OF THE CONGO', 'TOTAL FOR ALL COUNTRIES',
       'NORTH AMERICA', 'AFRICA', 'PACIFIC RIM COUNTRIES', 'CAFTA-DR',
       'AUSTRALIA AND OCEANIA', 'USMCA (NAFTA)',
       'TWENTY LATIN AMERICAN REPUBLICS', 'OECD', 'LAFTA', 'APEC', 'CACM',
       'MEXICO', 'GUATEMALA', 'COSTA RICA', 'DOMINICAN REPUBLIC',
       'TANZANIA', 'MADAGASCAR', 'CENTRAL AMERICA', 'SOUTH AMERICA',
       'CAMEROON', 'HAITI', 'INDIA', 'BOLIVIA', 'FRANCE', 'SWITZERLAND',
       'EUROPEAN UNION', 'EUROPE', 'ASIA', 'NATO', 'EURO AREA',
       'NICARAGUA', 'INDONESIA', 'PHILIPPINES', 'NIGERIA', 'CONGO',
       'ASEAN', 'NETHERLANDS', 'UGANDA', 'BELIZE', 'EL SALVADOR',
       'JAMAICA', 'TRINIDAD AND TOBAGO', 'HONDURAS', 'VIETNAM', 'TURKEY',
       'ITALY', 'CANADA', 'THAILAND', 'TOGO', 'BELGIUM', 'SIERRA LEONE',
       'MALAYSIA', 'GERMANY', 'CZECH REPUBLIC', 'PANAMA', 'SPAIN',
       'VANUATU', 'SAMOA'], dtype=object)
In [22]:
df['Country'].nunique()
Out[22]:
70

Identify groups of countries to remove, to keep only individual countries:

In [23]:
groups_to_remove = [
    "TOTAL FOR ALL COUNTRIES", "USMCA (NAFTA)", "OECD", "LAFTA", "APEC",
    "CACM", "EUROPEAN UNION", "EUROPE","AFRICA", "ASIA", "NATO", "EURO AREA", "ASEAN", "SOUTH AMERICA",
    'CENTRAL AMERICA', 'NORTH AMERICA',
    'AUSTRALIA AND OCEANIA',
    'EUROPEAN UNION', 'EURO AREA', 'NATO','TWENTY LATIN AMERICAN REPUBLICS',
    'PACIFIC RIM COUNTRIES'
]

df = df[~df["Country"].isin(groups_to_remove)]

Assign continent group to countries:

In [24]:
continent_map = {
    # Asia
    'INDONESIA': 'Asia', 'VIETNAM': 'Asia',
    'INDIA': 'Asia', 'THAILAND': 'Asia', 'PHILIPPINES': 'Asia', 'MALAYSIA': 'Asia',
    'TURKEY': 'Asia', #Turkey is 97% in Asia - 3% in Europe

    # Africa
    'CAMEROON': 'Africa', "COTE D'IVOIRE": 'Africa', 'GHANA': 'Africa', 'NIGERIA': 'Africa',
    'DEMOCRATIC REPUBLIC OF THE CONGO': 'Africa', 'CONGO': 'Africa', 'TANZANIA': 'Africa', 'UGANDA': 'Africa',
    'TOGO': 'Africa', 'SIERRA LEONE': 'Africa', 'MADAGASCAR': 'Africa',

    # Europe
    'NETHERLANDS': 'Europe', 'FRANCE': 'Europe', 'SWITZERLAND': 'Europe',
    'ITALY': 'Europe', 'BELGIUM': 'Europe', 'GERMANY': 'Europe', 'CZECH REPUBLIC': 'Europe', 'SPAIN': 'Europe',

    # North America
    'EL SALVADOR': 'North America', 'GRENADA': 'North America', 'PANAMA': 'North America',
    'CANADA': 'North America', 'MEXICO': 'North America', 'GUATEMALA': 'North America', 'HONDURAS': 'North America',
    'COSTA RICA': 'North America', 'HAITI': 'North America', 'DOMINICAN REPUBLIC': 'North America',
    'JAMAICA': 'North America',
    'TRINIDAD AND TOBAGO': 'North America', 'BELIZE': 'North America', 'NICARAGUA': 'North America',
    'CAFTA-DR': 'North America',

    # South America
    'ECUADOR': 'South America', 'PERU': 'South America', 'BRAZIL': 'South America', 'COLOMBIA': 'South America',
    'BOLIVIA': 'South America', 'VENEZUELA': 'South America',

    # Oceania
    'FIJI': 'Oceania', 'VANUATU': 'Oceania', 'SAMOA': 'Oceania', 'PAPUA NEW GUINEA': 'Oceania'

}

df['Continent'] = df['Country'].map(continent_map).fillna('Other')

print(df)
                 Country    Spend Import Date   Weight  Year  Month  \
0       PAPUA NEW GUINEA  1825500  2020-01-01  1463040  2020      1   
1                GRENADA     4172  2020-01-01     1090  2020      1   
2               COLOMBIA   254616  2020-01-01   151900  2020      1   
3              VENEZUELA   281799  2020-01-01   226802  2020      1   
4                ECUADOR  9281115  2020-01-01  6719234  2020      1   
...                  ...      ...         ...      ...   ...    ...   
3016          COSTA RICA        0  2024-12-01        0  2024     12   
3017               HAITI        0  2024-12-01        0  2024     12   
3018  DOMINICAN REPUBLIC  8116783  2024-12-01  2172360  2024     12   
3019            TANZANIA   105958  2024-12-01    25652  2024     12   
3020          MADAGASCAR   142291  2024-12-01    26648  2024     12   

      Price per Pound      Continent  
0            1.247744        Oceania  
1            3.827523  North America  
2            1.676208  South America  
3            1.242489  South America  
4            1.381276  South America  
...               ...            ...  
3016         0.000000  North America  
3017         0.000000  North America  
3018         3.736389  North America  
3019         4.130594         Africa  
3020         5.339650         Africa  

[1906 rows x 8 columns]
In [25]:
new_order = ['Continent','Country', 'Import Date', 'Month','Year', 'Weight', 'Price per Pound', 'Spend']

df = df[new_order]
In [26]:
df.head()
Out[26]:
Continent Country Import Date Month Year Weight Price per Pound Spend
0 Oceania PAPUA NEW GUINEA 2020-01-01 1 2020 1463040 1.247744 1825500
1 North America GRENADA 2020-01-01 1 2020 1090 3.827523 4172
2 South America COLOMBIA 2020-01-01 1 2020 151900 1.676208 254616
3 South America VENEZUELA 2020-01-01 1 2020 226802 1.242489 281799
4 South America ECUADOR 2020-01-01 1 2020 6719234 1.381276 9281115
In [27]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1906 entries, 0 to 3020
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Continent        1906 non-null   object        
 1   Country          1906 non-null   object        
 2   Import Date      1906 non-null   datetime64[ns]
 3   Month            1906 non-null   int32         
 4   Year             1906 non-null   int32         
 5   Weight           1906 non-null   int64         
 6   Price per Pound  1906 non-null   float64       
 7   Spend            1906 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int32(2), int64(2), object(2)
memory usage: 119.1+ KB

There are ~19K records now with 51 countries:

In [28]:
df['Country'].nunique()
Out[28]:
51

Alt text

Total US Cocoa Bean Imports | Spend by Year:

Spend in 2024 (1.2B) was a close 2nd to the highest which was in 2021 (1.3B):

In [29]:
spend_by_year = df.groupby(['Year'])['Spend'].sum().reset_index()

def axis_format(x, pos=None):
    if x >= 1e12:
        return f'{x/1e12:.1f}T'
    elif x >= 1e9:
        return f'{x/1e9:.1f}B'
    elif x >= 1e6:
        return f'{x/1e6:.1f}M'
    else:
        return f'{x:.0f}'

ax = sns.barplot(
    data=spend_by_year,
    x='Year',
    y='Spend',
    hue='Year',
    palette=custom_palette[:5],
    legend=False
)

for container in ax.containers:
    labels = [axis_format(v) for v in container.datavalues]
    ax.bar_label(container, labels=labels, padding=3)

ax.yaxis.set_major_formatter(mtick.FuncFormatter(axis_format))

plt.title('Total Spend by Year')
plt.ylabel('Total Spend')
plt.xlabel('Year')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Although spend was high, it was due to high prices. The volume (weight) actually had been trending down and was at the lowest in 5 years:

In [30]:
weight_by_year = df.groupby(['Year'])['Weight'].sum().reset_index()

def axis_format(x, pos=None):
    if x >= 1e12:
        return f'{x/1e12:.1f}T'
    elif x >= 1e9:
        return f'{x/1e9:.1f}B'
    elif x >= 1e6:
        return f'{x/1e6:.1f}M'
    else:
        return f'{x:.0f}'

ax = sns.barplot(
    data=weight_by_year,
    x='Year',
    y='Weight',
    hue='Year',
    palette=custom_palette[:5],
    legend=False
)

for container in ax.containers:
    labels = [axis_format(v) for v in container.datavalues]
    ax.bar_label(container, labels=labels, padding=3)

ax.yaxis.set_major_formatter(mtick.FuncFormatter(axis_format))

plt.title('Total Weight by Year')
plt.ylabel('Total Weight')
plt.xlabel('Year')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Monthly aggregated chart of Spend, Price per Pound, & Weight:

In [31]:
monthly_df = (
    df
    .groupby('Import Date', as_index=False)
    .agg({
        'Weight': 'sum',
        'Spend': 'sum'
    })
)

monthly_df['Price per Pound'] = monthly_df['Spend'] / monthly_df['Weight']
monthly_df['Year'] = monthly_df['Import Date'].dt.year
monthly_df['Month'] = monthly_df['Import Date'].dt.strftime('%b')
In [32]:
monthly_df.head()
Out[32]:
Import Date Weight Spend Price per Pound Year Month
0 2020-01-01 77863938 110827113 1.423343 2020 Jan
1 2020-02-01 118967413 183396139 1.541566 2020 Feb
2 2020-03-01 117594405 160633331 1.365995 2020 Mar
3 2020-04-01 72523979 109295892 1.507031 2020 Apr
4 2020-05-01 58180114 71059481 1.221371 2020 May

US Cocoa Bean Imports YOY | Spend:

In [33]:
plt.figure(figsize=(12, 8))

for i, year in enumerate(monthly_df['Year'].unique()):
    year_data = monthly_df[monthly_df['Year'] == year]
    plt.plot(year_data['Month'], year_data['Spend'],
             marker='o', linewidth=2.5, markersize=6,
             color=custom_palette[i % len(custom_palette)],
             label=str(year), markerfacecolor='white',
             markeredgewidth=2)

plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
plt.title('Cocoa Bean US Imports: Spend (YOY)', pad=20)
plt.xlabel('Month')
plt.ylabel('Total Spend')
plt.legend(title='Year', loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.subplots_adjust(right=0.85)
plt.show()
No description has been provided for this image

US Cocoa Bean Imports YOY | Price per Pound:

In [34]:
plt.figure(figsize=(12, 8))

for i, year in enumerate(monthly_df['Year'].unique()):
    year_data = monthly_df[monthly_df['Year'] == year]
    plt.plot(year_data['Month'], year_data['Price per Pound'],
             marker='o', linewidth=2.5, markersize=6,
             color=custom_palette[i % len(custom_palette)],
             label=str(year), markerfacecolor='white',
             markeredgewidth=2)

plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.2f}'))
plt.title('Cocoa Bean US Imports: Price per Pound (YOY)', pad=20)
plt.xlabel('Month')
plt.ylabel('Price per Pound')
plt.legend(title='Year', loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.subplots_adjust(right=0.85)
plt.show()
No description has been provided for this image

US Cocoa Bean Imports YOY | Weight:

In [35]:
plt.figure(figsize=(12, 8))

for i, year in enumerate(monthly_df['Year'].unique()):
    year_data = monthly_df[monthly_df['Year'] == year]
    plt.plot(year_data['Month'], year_data['Weight'],
             marker='o', linewidth=2.5, markersize=6,
             color=custom_palette[i % len(custom_palette)],
             label=str(year), markerfacecolor='white',
             markeredgewidth=2)
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
plt.title('Cocoa Bean US Imports: Weight (YOY)', pad=20)
plt.xlabel('Month')
plt.ylabel('Total Weight')
plt.legend(title='Year', loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.subplots_adjust(right=0.85)
plt.show()
No description has been provided for this image

Alt text

I first performed decomposition on the data to get an idea of its components (trend, seasonality, residuals):

Run sequence plots:

These charts are continuous time series that show the pattern of the data in one sequence (vs. YOY, etc.). They are a good visual to see if there is a trend, seasonality, etc.

In [36]:
def run_sequence_plot(x, y, title, xlabel="time", ylabel="series", width=13, height=4):
    plt.figure(figsize=(width, height))
    plt.plot(x, y, 'k-')
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.grid(alpha=0.3)
    plt.show()

The Weight and Spend show strong seasonal patterns:

In [37]:
run_sequence_plot(monthly_df['Import Date'], monthly_df['Weight'],
                  title="US Cocoa Bean Imports by Month (Weight lbs)")
No description has been provided for this image
In [38]:
run_sequence_plot(monthly_df['Import Date'], monthly_df['Spend'],
                  title="US Cocoa Bean Imports by Month (Spend)")
No description has been provided for this image

The Price shows a strong upward trend:

In [39]:
run_sequence_plot( monthly_df['Import Date'],  monthly_df['Price per Pound'],
                  title="US Cocoa Bean Imports by Month (Price per Pound)")
No description has been provided for this image

Transformations (to remove trend and seasonality):

Full decomposition for Weight (breaks out run sequence, trend, seasonality, residuals):

In [40]:
data = monthly_df.set_index('Import Date')['Weight']
In [41]:
display(data.head())
Import Date
2020-01-01     77863938
2020-02-01    118967413
2020-03-01    117594405
2020-04-01     72523979
2020-05-01     58180114
Name: Weight, dtype: int64

Period: 12 for yearly seasonality:

In [42]:
ss_decomposition = seasonal_decompose(x=data, model='additive', period=12)
estimated_trend = ss_decomposition.trend
estimated_seasonal = ss_decomposition.seasonal
estimated_residual = ss_decomposition.resid
In [43]:
fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
fig.set_figheight(10)
fig.set_figwidth(15)

axes[0].plot(data, label='Original', color='black')
axes[0].legend(loc='upper left')

axes[1].plot(estimated_trend, label='Trend', color='black')
axes[1].legend(loc='upper left')

axes[2].plot(estimated_seasonal, label='Seasonality', color='black')
axes[2].legend(loc='upper left')

axes[3].plot(estimated_residual, label='Residuals', color='black')
axes[3].legend(loc='upper left')
Out[43]:
<matplotlib.legend.Legend at 0x12dc59a60>
No description has been provided for this image

(The first chart in the group is the run sequence again)

Trend: The trend would be an increasing or decreasing slope over time (upward vs. downward). This shows a downward trend in cocoa bean imports over time.

Seasonality: Seasonality would be repeated, periodic patterns over time. This chart shows that for the most part there are peaks at the beginning of each year for cocoa bean imports and it drops throughout the year.

Residuals: The left-over fluctuations after removing the trend & seasonality, These appear pretty random and uncorrelated for the most part. When they are random, correlated, normally distributed and have a mean around 0, it means that the trend and seasonality have effectively been identified and removed from the data. Here the mean is around 0, and there is no trend or seasonality left and the mean is constant.

Plotting the residuals in a histograms shows that they roughly follow a normal distribution:

In [44]:
plt.figure(figsize=(10, 5))
plt.hist(estimated_residual, bins=30, color=custom_palette[0], edgecolor='gray')
plt.title('Histogram of Residuals')
plt.xlabel('Residual Value')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

The first and last 6 values of the residuals are NaN and are removed:

In [45]:
print(estimated_residual[-6:])
Import Date
2024-07-01   NaN
2024-08-01   NaN
2024-09-01   NaN
2024-10-01   NaN
2024-11-01   NaN
2024-12-01   NaN
Name: resid, dtype: float64
In [46]:
print(estimated_residual[:6])
Import Date
2020-01-01   NaN
2020-02-01   NaN
2020-03-01   NaN
2020-04-01   NaN
2020-05-01   NaN
2020-06-01   NaN
Name: resid, dtype: float64

Autocorrelation: Autocorrelation shows if there is a correlation between previous periods. This shows 12 lags since the data is monthly. There is significant (outside the blue) correlation with the 1st 3 months and the 12th month.

(Ignore Lag 0: always shows up and is equal to 1 because lags are a correlation with the current value, and 0 is the current value, so it will be 1)

Lags 1-2: means that the values for today are correlated to those for last month, & 2 months ago. Lag 12: means the values for today are correlated to those from 12 months ago. High autocorrelation at smaller lags and slow decay is representative of a trend.

I plotted 24 lags to see if there is a seasonality cycle longer than 12, but there is not. 12 is the last significant lag, so the period for the decomposition should remain at 12 for annual seasonality.

Lags 1, 2 & 12 are significant for autocorrelation, but looking at the partial autocorrelation where it removes the dependence on the previous lag, only 1 and 12 are significant, showing that the current value is correlated with the previous value (AR1) and the value 12 months ago (seasonal AR term since data is monthly).

In [47]:
plot_acf(data, lags=24)
plot_pacf(data, lags=24)
plt.show()
No description has been provided for this image
No description has been provided for this image
In [48]:
residuals_final = estimated_residual[6:-6]
#same as estimated_residual.dropna()

Residuals show autocorrelation at lag 4, which I gave the option to be modeled in SARIMA later (turned stepwise off and allowed it to go up to 4) but a model including this lag was not the best performer, so this effect was not as important:

In [49]:
plot_acf(residuals_final)
plot_pacf(residuals_final)
plt.show()
No description has been provided for this image
No description has been provided for this image

The Dickey Fuller Test confirms the original data is non-stationary as we cannot reject the null that the data is non-stationary:

(Stationary data would have constant mean & variance, no autocorrelation, and no periodic component)

In [50]:
adf_b4, pvalue_b4, usedlag_, nobs_, critical_values_, icbest_ = adfuller(data)
print(f"ADF: {adf_b4:.6f}")
print(f"p-value: {pvalue_b4:.9f}")
ADF: 0.021728
p-value: 0.960292504

Test the residuals for stationarity (after removing 1st and last 6 values since they're NaN):

After removing the trend and seasonality, the residuals are stationary (p-value <0.05) and no further differencing is needed, therefore this decomposition adquately captures the patterns in the data as a good way to understand it:

In [51]:
adf_b4, pvalue_b4, usedlag_, nobs_, critical_values_, icbest_ = adfuller(residuals_final)
print(f"ADF: {adf_b4:.6f}")
print(f"p-value: {pvalue_b4:.9f}")
ADF: -4.862790
p-value: 0.000041182

Alt text

Model 1: Smoothing¶

I first performed Smoothing as a model option/baseline. I chose Triple Exponential Smoothing, as that is the best option when there is both trend and seasonality.

Smoothing calculates a moving average in order to reduce noise. Exponential Smoothing weights the more recent lags more heavily. Triple Exponential Smoothing includes 3 components to smooth the value, the trend, and the seasonality.

In [52]:
monthly_df.head()
Out[52]:
Import Date Weight Spend Price per Pound Year Month
0 2020-01-01 77863938 110827113 1.423343 2020 Jan
1 2020-02-01 118967413 183396139 1.541566 2020 Feb
2 2020-03-01 117594405 160633331 1.365995 2020 Mar
3 2020-04-01 72523979 109295892 1.507031 2020 Apr
4 2020-05-01 58180114 71059481 1.221371 2020 May
In [53]:
data = pd.DataFrame({
    'Import Date': monthly_df['Import Date'],
    'Weight': monthly_df['Weight']
})
In [54]:
data.head()
Out[54]:
Import Date Weight
0 2020-01-01 77863938
1 2020-02-01 118967413
2 2020-03-01 117594405
3 2020-04-01 72523979
4 2020-05-01 58180114
In [55]:
train = data[:-5]
test = data[-5:]
In [56]:
triple = ExponentialSmoothing(train['Weight'],
                              trend="additive",damped_trend=True,
                              seasonal="additive",
                              seasonal_periods=12).fit(optimized=True)
triple_preds = triple.forecast(len(test['Weight']))

The untuned Exponential Smoothing performs very poorly, greatly overestimating the downward trend:

In [57]:
plt.plot(train['Import Date'], train['Weight'], color=custom_palette[2], label="train")
plt.plot(test['Import Date'], test['Weight'], color=custom_palette[1], linestyle="--", label="test")
plt.plot(test['Import Date'], triple_preds, color=custom_palette[0], linestyle="--", label="predictions")

def millions(x, pos):
    return f'{int(x/1e6)}M'

plt.gca().yaxis.set_major_formatter(FuncFormatter(millions))

plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Below is a Rolling CV for the model. It forecasts a 3 month horizon (add one month, predict 3 month out, add another month, predict again, etc.) starting with a cutoff of 45 months (Sep '23) and predicting through the 60th month of data (Dec '24):

Reported are the forecast vs. actual (yhat vs. y) and the absolute error (difference btw those two) and the absolute percentage error (error/actual):

In [58]:
horizon = 3
exp_ts_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train_series = data['Weight'].iloc[:cutoff + 1]

    triple = ExponentialSmoothing(
        train_series,
        trend="additive", damped_trend=True,
        seasonal="additive", seasonal_periods=12
    ).fit(optimized=True)

    yhat = triple.forecast(steps=horizon)
    y_true = data['Weight'].iloc[cutoff + 1 : cutoff + 1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        exp_ts_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

exp_ts_cv_df = pd.DataFrame(exp_ts_results)

exp_ts_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[58]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 6,902,291 8,270,097 55%
1 44 2 09-01-2023 11-01-2023 11,741,002 3,755,247 7,985,755 68%
2 44 3 09-01-2023 12-01-2023 17,305,596 16,758,376 547,220 3%
3 45 1 10-01-2023 11-01-2023 11,741,002 6,045,882 5,695,120 49%
4 45 2 10-01-2023 12-01-2023 17,305,596 19,019,260 1,713,664 10%
5 45 3 10-01-2023 01-01-2024 64,702,033 66,344,501 1,642,468 3%
6 46 1 11-01-2023 12-01-2023 17,305,596 20,419,934 3,114,338 18%
7 46 2 11-01-2023 01-01-2024 64,702,033 67,744,089 3,042,056 5%
8 46 3 11-01-2023 02-01-2024 65,375,575 44,310,438 21,065,137 32%
9 47 1 12-01-2023 01-01-2024 64,702,033 65,781,660 1,079,627 2%
10 47 2 12-01-2023 02-01-2024 65,375,575 49,667,363 15,708,212 24%
11 47 3 12-01-2023 03-01-2024 62,364,919 83,586,898 21,221,979 34%
12 48 1 01-01-2024 02-01-2024 65,375,575 49,337,538 16,038,037 25%
13 48 2 01-01-2024 03-01-2024 62,364,919 83,257,077 20,892,158 33%
14 48 3 01-01-2024 04-01-2024 34,933,233 88,121,981 53,188,748 152%
15 49 1 02-01-2024 03-01-2024 62,364,919 88,140,823 25,775,904 41%
16 49 2 02-01-2024 04-01-2024 34,933,233 93,005,732 58,072,499 166%
17 49 3 02-01-2024 05-01-2024 35,504,037 69,119,308 33,615,271 95%
18 50 1 03-01-2024 04-01-2024 34,933,233 85,707,057 50,773,824 145%
19 50 2 03-01-2024 05-01-2024 35,504,037 61,820,637 26,316,600 74%
20 50 3 03-01-2024 06-01-2024 26,935,635 52,160,066 25,224,431 94%
21 51 1 04-01-2024 05-01-2024 35,504,037 43,120,813 7,616,776 21%
22 51 2 04-01-2024 06-01-2024 26,935,635 33,690,321 6,754,686 25%
23 51 3 04-01-2024 07-01-2024 19,175,494 854,284 18,321,210 96%
24 52 1 05-01-2024 06-01-2024 26,935,635 29,169,740 2,234,105 8%
25 52 2 05-01-2024 07-01-2024 19,175,494 -3,643,501 22,818,995 119%
26 52 3 05-01-2024 08-01-2024 21,449,146 -6,523,088 27,972,234 130%
27 53 1 06-01-2024 07-01-2024 19,175,494 -4,737,622 23,913,116 125%
28 53 2 06-01-2024 08-01-2024 21,449,146 -7,620,189 29,069,335 136%
29 53 3 06-01-2024 09-01-2024 23,828,291 2,650,308 21,177,983 89%
30 54 1 07-01-2024 08-01-2024 21,449,146 1,861,780 19,587,366 91%
31 54 2 07-01-2024 09-01-2024 23,828,291 12,006,067 11,822,224 50%
32 54 3 07-01-2024 10-01-2024 20,121,692 -3,091,509 23,213,201 115%
33 55 1 08-01-2024 09-01-2024 23,828,291 23,815,662 12,629 0%
34 55 2 08-01-2024 10-01-2024 20,121,692 9,571,620 10,550,072 52%
35 55 3 08-01-2024 11-01-2024 23,311,088 5,205,149 18,105,939 78%
36 56 1 09-01-2024 10-01-2024 20,121,692 9,577,936 10,543,756 52%
37 56 2 09-01-2024 11-01-2024 23,311,088 5,211,465 18,099,623 78%
38 56 3 09-01-2024 12-01-2024 31,759,202 12,984,396 18,774,806 59%

The MAPEs (mean absolute percentage error) are very high and increase from the one month out prediction to the 3 month out prediction:

In [59]:
exp_ts_cv = exp_ts_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

exp_ts_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[59]:
  MAE MAPE
horizon    
1 13,434,977 49%
2 17,911,221 65%
3 21,851,587 75%

The overall MAPE for the whole model is 67% which is very high:

In [60]:
exp_ts_overall_MAE  = exp_ts_cv_df['AE'].mean()
exp_ts_overall_MAPE = exp_ts_cv_df['APE'].mean()

print(f"MAE: {exp_ts_overall_MAE:,.0f}")
print(f"MAPE (%): {exp_ts_overall_MAPE:.0%}")
MAE: 17,732,595
MAPE (%): 63%

Plot forecast (yhat) vs. actuals (y) for ALL data (not just a train/test split like above) to see model fit:

In [61]:
triple = ExponentialSmoothing(
    data['Weight'],
    trend="additive", damped_trend=True,
    seasonal="additive", seasonal_periods=12
).fit(optimized=True)

fitted = triple.fittedvalues

forecast_steps = 3
forecast = triple.forecast(steps=forecast_steps)

last_date = data['Import Date'].iloc[-1]
future_dates = pd.date_range(
    start=last_date + pd.offsets.MonthBegin(1),
    periods=forecast_steps,
    freq='MS'
)

combined_preds = np.concatenate([fitted.values, forecast.values])
combined_dates = pd.concat([data['Import Date'], pd.Series(future_dates)], ignore_index=True)

plt.figure(figsize=(12, 5))

plt.plot(data['Import Date'], data['Weight'], color=custom_palette[2], label="actual")

plt.plot(combined_dates, combined_preds, color=custom_palette[0], linestyle='--', label="model fit + forecast")

plt.axvspan(future_dates[0], future_dates[-1], color=custom_palette[0], alpha=0.1)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1e6)}M'))

plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

I moved to tuning the Triple Exponential Smoothing by making the seasonality multiplicative. It does look like it has a fair multiplicative effect in that the seasonality jumps look like they decrease in magnitude with time. This adjustment does improve the model greatly:

In [62]:
triple_tuned = ExponentialSmoothing(train['Weight'],
                              trend="additive",damped_trend=True,
                              seasonal="multiplicative",
                              seasonal_periods=12).fit(optimized=True)
triple_tuned_preds = triple_tuned.forecast(len(test['Weight']))

The forecast for the last 3 months greatly improves vs. the previous untuned version:

In [63]:
plt.plot(train['Import Date'], train['Weight'], color=custom_palette[2], label="train")
plt.plot(test['Import Date'], test['Weight'], color=custom_palette[1], linestyle="--", label="test")
plt.plot(test['Import Date'], triple_tuned_preds, color=custom_palette[0], linestyle="--", label="predictions")

def millions(x, pos):
    return f'{int(x/1e6)}M'

plt.gca().yaxis.set_major_formatter(FuncFormatter(millions))

plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing (tuned)")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image
In [64]:
horizon = 3
exp_ts_tuned_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data['Weight'].iloc[:cutoff + 1]

    triple_tuned = ExponentialSmoothing(
        train,
        trend="additive", damped_trend=True,
        seasonal="multiplicative", seasonal_periods=12
    ).fit(optimized=True)

    yhat = triple_tuned.forecast(steps=horizon)
    y_true = data['Weight'].iloc[cutoff + 1 : cutoff + 1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        exp_ts_tuned_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

exp_ts_tuned_cv_df = pd.DataFrame(exp_ts_tuned_results)
exp_ts_tuned_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[64]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 19,758,979 4,586,591 30%
1 44 2 09-01-2023 11-01-2023 11,741,002 18,827,434 7,086,432 60%
2 44 3 09-01-2023 12-01-2023 17,305,596 20,571,930 3,266,334 19%
3 45 1 10-01-2023 11-01-2023 11,741,002 17,314,661 5,573,659 47%
4 45 2 10-01-2023 12-01-2023 17,305,596 18,983,433 1,677,837 10%
5 45 3 10-01-2023 01-01-2024 64,702,033 44,986,142 19,715,891 30%
6 46 1 11-01-2023 12-01-2023 17,305,596 16,869,278 436,318 3%
7 46 2 11-01-2023 01-01-2024 64,702,033 39,890,376 24,811,657 38%
8 46 3 11-01-2023 02-01-2024 65,375,575 39,351,743 26,023,832 40%
9 47 1 12-01-2023 01-01-2024 64,702,033 40,252,571 24,449,462 38%
10 47 2 12-01-2023 02-01-2024 65,375,575 39,708,749 25,666,826 39%
11 47 3 12-01-2023 03-01-2024 62,364,919 52,317,225 10,047,694 16%
12 48 1 01-01-2024 02-01-2024 65,375,575 47,579,349 17,796,226 27%
13 48 2 01-01-2024 03-01-2024 62,364,919 62,858,839 493,920 1%
14 48 3 01-01-2024 04-01-2024 34,933,233 54,889,734 19,956,501 57%
15 49 1 02-01-2024 03-01-2024 62,364,919 70,265,094 7,900,175 13%
16 49 2 02-01-2024 04-01-2024 34,933,233 61,276,429 26,343,196 75%
17 49 3 02-01-2024 05-01-2024 35,504,037 49,089,874 13,585,837 38%
18 50 1 03-01-2024 04-01-2024 34,933,233 59,021,731 24,088,498 69%
19 50 2 03-01-2024 05-01-2024 35,504,037 47,261,433 11,757,396 33%
20 50 3 03-01-2024 06-01-2024 26,935,635 43,538,778 16,603,143 62%
21 51 1 04-01-2024 05-01-2024 35,504,037 41,299,719 5,795,682 16%
22 51 2 04-01-2024 06-01-2024 26,935,635 38,068,462 11,132,827 41%
23 51 3 04-01-2024 07-01-2024 19,175,494 21,628,102 2,452,608 13%
24 52 1 05-01-2024 06-01-2024 26,935,635 36,359,652 9,424,017 35%
25 52 2 05-01-2024 07-01-2024 19,175,494 20,666,188 1,490,694 8%
26 52 3 05-01-2024 08-01-2024 21,449,146 20,135,102 1,314,044 6%
27 53 1 06-01-2024 07-01-2024 19,175,494 18,916,032 259,462 1%
28 53 2 06-01-2024 08-01-2024 21,449,146 18,452,365 2,996,781 14%
29 53 3 06-01-2024 09-01-2024 23,828,291 22,018,223 1,810,068 8%
30 54 1 07-01-2024 08-01-2024 21,449,146 18,532,698 2,916,448 14%
31 54 2 07-01-2024 09-01-2024 23,828,291 22,114,177 1,714,114 7%
32 54 3 07-01-2024 10-01-2024 20,121,692 16,034,955 4,086,737 20%
33 55 1 08-01-2024 09-01-2024 23,828,291 23,203,068 625,223 3%
34 55 2 08-01-2024 10-01-2024 20,121,692 16,811,767 3,309,925 16%
35 55 3 08-01-2024 11-01-2024 23,311,088 16,044,099 7,266,989 31%
36 56 1 09-01-2024 10-01-2024 20,121,692 16,951,574 3,170,118 16%
37 56 2 09-01-2024 11-01-2024 23,311,088 16,176,222 7,134,866 31%
38 56 3 09-01-2024 12-01-2024 31,759,202 17,721,125 14,038,077 44%

The MAPEs drastically decrease and fall in an acceptable range (20-30%) for volatile, seasonal data:

In [65]:
exp_ts_tuned_cv = exp_ts_tuned_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

exp_ts_tuned_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[65]:
  MAE MAPE
horizon    
1 8,232,452 24%
2 9,662,806 29%
3 10,782,135 30%

The overall MAPE is 27%:

In [66]:
exp_ts_tuned_overall_MAE  = exp_ts_tuned_cv_df['AE'].mean()
exp_ts_tuned_overall_MAPE = exp_ts_tuned_cv_df['APE'].mean()
print(f"MAE: {exp_ts_tuned_overall_MAE:,.0f}")
print(f"MAPE: {exp_ts_tuned_overall_MAPE:.0%}")
MAE: 9,559,131
MAPE: 27%

I want to see the full fitted model (fit vs. actual) and the forecast (shaded region) together, so I plot them below:

Fitted values : “What would I have predicted historically?”

Forecasts: “What do I predict now?”

The model fit looks very good given the volatility:

In [67]:
triple_tuned = ExponentialSmoothing(
    data['Weight'],
    trend="additive",
    damped_trend=True,
    seasonal="multiplicative",
    seasonal_periods=12
).fit(optimized=True)

fitted = triple_tuned.fittedvalues

forecast_steps = 3
forecast = triple_tuned.forecast(steps=forecast_steps)

last_date = data['Import Date'].iloc[-1]
future_dates = pd.date_range(
    start=last_date + pd.offsets.MonthBegin(1),
    periods=forecast_steps,
    freq='MS'
)

combined_preds = np.concatenate([fitted.values, forecast.values])
combined_dates = pd.concat([data['Import Date'], pd.Series(future_dates)], ignore_index=True)

plt.figure(figsize=(12, 5))

plt.plot(data['Import Date'], data['Weight'], color=custom_palette[2], label="actual")

plt.plot(combined_dates, combined_preds, color=custom_palette[0], linestyle='--', label="model fit + forecast")

plt.axvspan(future_dates[0], future_dates[-1], color=custom_palette[0], alpha=0.1)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1e6)}M'))

plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing (Tuned)")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

The Exponential Smoothing model does not provide Confidence Intervals, so I have manually bootstrapped them below (and later switch to ETS to replicate and report their built-in CI as another option):

Bootstrapped CI:

Note, I changed the CI to 80% to get a narrower forecast range which in this case with the CI being so wide would be practical. This is consistent with the default CI from Prophet. The 95% CIs were very wide and uninformative.

In [68]:
def bootstrap_hw_intervals(model,train_values, horizon, n_boot=500):
    residuals = model.resid
    forecasts = np.zeros((n_boot, horizon))

    for i in range(n_boot):
        boot_sample = train_values + np.random.choice(residuals, size=len(train_values))
        boot_sample = np.maximum(boot_sample, 1e-6)

        boot_model = ExponentialSmoothing(
            boot_sample,
            trend="additive",
            damped_trend=True,
            seasonal="multiplicative",
            seasonal_periods=12
        ).fit(optimized=True)

        forecasts[i, :] = boot_model.forecast(steps=horizon)

    lower = np.percentile(forecasts, 10, axis=0)
    upper = np.percentile(forecasts, 90, axis=0)

    return pd.DataFrame({
        'horizon': range(1, horizon + 1),
        'point_forecast': model.forecast(steps=horizon),
        'ci_lower': lower,
        'ci_upper': upper
    })

Fit Holt-Winters Exponential Smoothing on full history:

In [69]:
np.random.seed(42)

train_values = data['Weight'].values
final_hw = ExponentialSmoothing(
    train_values,
    trend="additive", damped_trend=True,
    seasonal="multiplicative", seasonal_periods=12
).fit(optimized=True)

ci_df = bootstrap_hw_intervals(final_hw, train_values, horizon=3, n_boot=500)
In [70]:
ci_df.style.format("{:,.0f}")
Out[70]:
  horizon point_forecast ci_lower ci_upper
0 1 59,915,877 31,652,931 67,112,959
1 2 59,477,191 30,411,924 65,507,608
2 3 75,993,847 38,341,421 84,995,670
In [71]:
future_dates = pd.date_range(
    start=data['Import Date'].iloc[-1] + pd.offsets.MonthBegin(1),
    periods=3,
    freq='MS'
)

forecast_table = pd.DataFrame({
    'Date': future_dates,
    'Forecast': ci_df['point_forecast'].round().astype(int),
    'Lower_80%_CI': np.floor(ci_df['ci_lower']).astype(int),
    'Upper_80%_CI': np.ceil(ci_df['ci_upper']).astype(int)
})
print(forecast_table)
        Date  Forecast  Lower_80%_CI  Upper_80%_CI
0 2025-01-01  59915877      31652931      67112960
1 2025-02-01  59477191      30411924      65507609
2 2025-03-01  75993847      38341421      84995671
In [72]:
triple_tuned = ExponentialSmoothing(
    data['Weight'],
    trend="additive",
    damped_trend=True,
    seasonal="multiplicative",
    seasonal_periods=12
).fit(optimized=True)

fitted = triple_tuned.fittedvalues
forecast_steps = 3
forecast = triple_tuned.forecast(steps=forecast_steps)

last_date = data['Import Date'].iloc[-1]
future_dates = pd.date_range(
    start=last_date + pd.offsets.MonthBegin(1),
    periods=forecast_steps,
    freq='MS'
)

combined_preds = np.concatenate([fitted.values, forecast.values])
combined_dates = pd.concat([data['Import Date'], pd.Series(future_dates)], ignore_index=True)

plt.figure(figsize=(12, 5))
plt.plot(data['Import Date'], data['Weight'], color=custom_palette[2], label="actual")
plt.plot(combined_dates, combined_preds, color=custom_palette[0], linestyle='--',
         label="model fit + forecast")

plt.fill_between(
    future_dates,
    ci_df['ci_lower'].values,
    ci_df['ci_upper'].values,
    color=custom_palette[0],
    alpha=0.2,
    label='80% Confidence Interval'
)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1e6)}M'))
plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing (Tuned)")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Log transformation:

Since the data is volatile, I also try out log transforming the data:

This showed no improvement over the Tuned Triple Exponential Smoothing with original data:

In [73]:
train = data['Weight'][:-5]
test = data['Weight'][-5:]

triple_tuned_log = ExponentialSmoothing(np.log1p(train),
                              trend="additive",damped_trend=True,
                              seasonal="multiplicative",
                              seasonal_periods=12).fit(optimized=True)
triple_tuned_log_preds = np.expm1(triple_tuned_log.forecast(len(test)))
In [74]:
train = data[:-5]
test = data[-5:]
In [75]:
plt.plot(train['Import Date'], train['Weight'], color=custom_palette[2], label="train")
plt.plot(test['Import Date'], test['Weight'], color=custom_palette[1], linestyle="--", label="test")
plt.plot(test['Import Date'], triple_tuned_log_preds, color=custom_palette[0], linestyle="--", label="predictions")

def millions(x, pos):
    return f'{int(x/1e6)}M'

plt.gca().yaxis.set_major_formatter(FuncFormatter(millions))

plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing (tuned log data)")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image
In [76]:
horizon = 3
exp_ts_tuned_log_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train_log = np.log1p(data['Weight'].iloc[:cutoff + 1])

    triple_tuned_log = ExponentialSmoothing(
        train_log,
        trend="additive", damped_trend=True,
        seasonal="multiplicative", seasonal_periods=12
    ).fit(optimized=True)

    yhat_log = triple_tuned_log.forecast(steps=horizon)
    yhat = np.expm1(yhat_log)

    y_true = data['Weight'].iloc[cutoff + 1 : cutoff + 1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        exp_ts_tuned_log_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

exp_ts_tuned_log_cv_df = pd.DataFrame(exp_ts_tuned_log_results)
exp_ts_tuned_log_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[76]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 18,717,066 3,544,678 23%
1 44 2 09-01-2023 11-01-2023 11,741,002 16,070,641 4,329,639 37%
2 44 3 09-01-2023 12-01-2023 17,305,596 18,858,823 1,553,227 9%
3 45 1 10-01-2023 11-01-2023 11,741,002 14,724,691 2,983,689 25%
4 45 2 10-01-2023 12-01-2023 17,305,596 17,270,935 34,661 0%
5 45 3 10-01-2023 01-01-2024 64,702,033 39,771,856 24,930,177 39%
6 46 1 11-01-2023 12-01-2023 17,305,596 15,722,629 1,582,967 9%
7 46 2 11-01-2023 01-01-2024 64,702,033 36,045,965 28,656,068 44%
8 46 3 11-01-2023 02-01-2024 65,375,575 33,847,619 31,527,956 48%
9 47 1 12-01-2023 01-01-2024 64,702,033 37,501,335 27,200,698 42%
10 47 2 12-01-2023 02-01-2024 65,375,575 35,207,647 30,167,928 46%
11 47 3 12-01-2023 03-01-2024 62,364,919 45,811,770 16,553,149 27%
12 48 1 01-01-2024 02-01-2024 65,375,575 42,546,741 22,828,834 35%
13 48 2 01-01-2024 03-01-2024 62,364,919 55,507,300 6,857,619 11%
14 48 3 01-01-2024 04-01-2024 34,933,233 53,048,403 18,115,170 52%
15 49 1 02-01-2024 03-01-2024 62,364,919 65,023,435 2,658,516 4%
16 49 2 02-01-2024 04-01-2024 34,933,233 62,077,162 27,143,929 78%
17 49 3 02-01-2024 05-01-2024 35,504,037 47,850,759 12,346,722 35%
18 50 1 03-01-2024 04-01-2024 34,933,233 61,005,118 26,071,885 75%
19 50 2 03-01-2024 05-01-2024 35,504,037 47,037,305 11,533,268 32%
20 50 3 03-01-2024 06-01-2024 26,935,635 41,935,796 15,000,161 56%
21 51 1 04-01-2024 05-01-2024 35,504,037 39,710,590 4,206,553 12%
22 51 2 04-01-2024 06-01-2024 26,935,635 35,457,392 8,521,757 32%
23 51 3 04-01-2024 07-01-2024 19,175,494 20,560,354 1,384,860 7%
24 52 1 05-01-2024 06-01-2024 26,935,635 34,303,281 7,367,646 27%
25 52 2 05-01-2024 07-01-2024 19,175,494 19,913,667 738,173 4%
26 52 3 05-01-2024 08-01-2024 21,449,146 19,408,741 2,040,405 10%
27 53 1 06-01-2024 07-01-2024 19,175,494 18,493,004 682,490 4%
28 53 2 06-01-2024 08-01-2024 21,449,146 18,030,222 3,418,924 16%
29 53 3 06-01-2024 09-01-2024 23,828,291 19,633,621 4,194,670 18%
30 54 1 07-01-2024 08-01-2024 21,449,146 18,245,027 3,204,119 15%
31 54 2 07-01-2024 09-01-2024 23,828,291 19,868,026 3,960,265 17%
32 54 3 07-01-2024 10-01-2024 20,121,692 14,249,175 5,872,517 29%
33 55 1 08-01-2024 09-01-2024 23,828,291 20,858,982 2,969,309 12%
34 55 2 08-01-2024 10-01-2024 20,121,692 14,943,318 5,178,374 26%
35 55 3 08-01-2024 11-01-2024 23,311,088 12,556,075 10,755,013 46%
36 56 1 09-01-2024 10-01-2024 20,121,692 15,513,082 4,608,610 23%
37 56 2 09-01-2024 11-01-2024 23,311,088 13,028,289 10,282,799 44%
38 56 3 09-01-2024 12-01-2024 31,759,202 16,112,426 15,646,776 49%
In [77]:
exp_ts_tuned_log_cv = exp_ts_tuned_log_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

exp_ts_tuned_log_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[77]:
  MAE MAPE
horizon    
1 8,454,615 24%
2 10,832,569 30%
3 12,301,600 33%

The performance is slightly worse than without the log transformation:

In [78]:
exp_ts_tuned_log_overall_MAE  = exp_ts_tuned_log_cv_df['AE'].mean()
exp_ts_tuned_log_overall_MAPE = exp_ts_tuned_log_cv_df['APE'].mean()

print(f"MAE: {exp_ts_tuned_log_overall_MAE:,.0f}")
print(f"MAPE (%): {exp_ts_tuned_log_overall_MAPE:.0%}")
MAE: 10,529,595
MAPE (%): 29%
In [79]:
triple_tuned_log = ExponentialSmoothing(
    np.log1p(data['Weight']),
    trend="additive",
    damped_trend=True,
    seasonal="multiplicative",
    seasonal_periods=12
).fit(optimized=True)

fitted_log = np.expm1(triple_tuned_log.fittedvalues)

forecast_steps = 3
forecast_log = np.expm1(triple_tuned_log.forecast(steps=forecast_steps))

last_date = data['Import Date'].iloc[-1]
future_dates = pd.date_range(
    start=last_date + pd.offsets.MonthBegin(1),
    periods=forecast_steps,
    freq='MS'
)

combined_preds_log = np.concatenate([fitted_log.values, forecast_log.values])
combined_dates_log = pd.concat([data['Import Date'], pd.Series(future_dates)], ignore_index=True)

plt.figure(figsize=(12, 5))

plt.plot(data['Import Date'], data['Weight'], color=custom_palette[2], label="actual")

plt.plot(combined_dates_log, combined_preds_log, color=custom_palette[0], linestyle='--', label="model fit + forecast")

plt.axvspan(future_dates[0], future_dates[-1], color=custom_palette[0], alpha=0.1)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1e6)}M'))

plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing (Tuned - Log Data)")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Add CI:

In [80]:
def bootstrap_hw_intervals(model,train_values, horizon, n_boot=500):
    residuals = model.resid
    forecasts = np.zeros((n_boot, horizon))

    for i in range(n_boot):
        boot_sample = train_values + np.random.choice(residuals, size=len(train_values))
        boot_sample = np.maximum(boot_sample, 1e-6)

        boot_model = ExponentialSmoothing(
            boot_sample,
            trend="additive",
            damped_trend=True,
            seasonal="multiplicative",
            seasonal_periods=12
        ).fit(optimized=True)

        forecasts[i, :] = np.expm1(boot_model.forecast(steps=horizon))

    lower = np.percentile(forecasts, 10, axis=0)
    upper = np.percentile(forecasts, 90, axis=0)

    return pd.DataFrame({
        'horizon': range(1, horizon + 1),
        'point_forecast': np.expm1(model.forecast(steps=horizon)),
        'ci_lower': lower,
        'ci_upper': upper
    })
In [81]:
np.random.seed(42)

train_values = np.log1p(data['Weight'].values)
final_hw = ExponentialSmoothing(
    train_values,
    trend="additive", damped_trend=True,
    seasonal="multiplicative", seasonal_periods=12
).fit(optimized=True)

ci_df = bootstrap_hw_intervals(final_hw, train_values, horizon=3, n_boot=500)
In [82]:
ci_df.style.format("{:.0f}")
Out[82]:
  horizon point_forecast ci_lower ci_upper
0 1 69648880 42252001 73726954
1 2 66175597 40121527 69833012
2 3 81596208 50389155 86459868
In [83]:
future_dates = pd.date_range(
    start=data['Import Date'].iloc[-1] + pd.offsets.MonthBegin(1),
    periods=3,
    freq='MS'
)

forecast_table = pd.DataFrame({
    'Date': future_dates,
    'Forecast': ci_df['point_forecast'].round().astype(int),
    'Lower_80_CI': np.floor(ci_df['ci_lower']).astype(int),
    'Upper_80_CI': np.ceil(ci_df['ci_upper']).astype(int)
})
print(forecast_table)
        Date  Forecast  Lower_80_CI  Upper_80_CI
0 2025-01-01  69648880     42252001     73726955
1 2025-02-01  66175597     40121527     69833012
2 2025-03-01  81596208     50389155     86459868
In [84]:
triple_tuned_log = ExponentialSmoothing(
    np.log1p(data['Weight']),
    trend="additive",
    damped_trend=True,
    seasonal="multiplicative",
    seasonal_periods=12
).fit(optimized=True)

fitted = np.expm1(triple_tuned_log.fittedvalues)
forecast_steps = 3
forecast = np.expm1(triple_tuned_log.forecast(steps=forecast_steps))

last_date = data['Import Date'].iloc[-1]
future_dates = pd.date_range(
    start=last_date + pd.offsets.MonthBegin(1),
    periods=forecast_steps,
    freq='MS'
)

combined_preds_log = np.concatenate([fitted.values, forecast.values])
combined_dates_log = pd.concat([data['Import Date'], pd.Series(future_dates)], ignore_index=True)

plt.figure(figsize=(12, 5))
plt.plot(data['Import Date'], data['Weight'], color=custom_palette[2], label="actual")
plt.plot(combined_dates, combined_preds, color=custom_palette[0], linestyle='--',
         label="model fit + forecast")

plt.fill_between(
    future_dates,
    ci_df['ci_lower'].values,
    ci_df['ci_upper'].values,
    color=custom_palette[0],
    alpha=0.2,
    label='80% Confidence Interval'
)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1e6)}M'))
plt.legend(loc='upper left')
plt.title("Triple Exponential Smoothing (Tuned - Log Data)")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Model 2: SARIMA¶

I next chose the SARIMA model to see if performance improved. It seemed like a good option given that the data is nonstationary and there is seasonality:

From earlier plots: (For non-log data) ACF has 1,2,12 lags and PACF has 1,12 lags

Model Plot Observation
AR(p) PACF Cuts off after lag p
MA(q) ACF Cuts off after lag q

PACF lag 1 → p = 1

ACF lag 1–2 → q = 2

Non-seasonal parameters: (p, q) = (1, 2). d in (p,d,q) will be determined from differencing below.

PACF at lag 12 → seasonal AR → P = 1

ACF at lag 12 → seasonal MA → Q = 1

There were no spikes at 24, 36 so will not consider higher orders.

Seasonal differencing D: D =1 if strong spike on ACF at lag 12.

The results (including differencing below) point to (1,1,2) (1,1,1,12) being the ideal orders for the non-log data.

In [85]:
data.head()
Out[85]:
Import Date Weight
0 2020-01-01 77863938
1 2020-02-01 118967413
2 2020-03-01 117594405
3 2020-04-01 72523979
4 2020-05-01 58180114
In [86]:
plot_acf(data['Weight'], lags=24)
plot_pacf(data['Weight'], lags=24)
plt.show()
No description has been provided for this image
No description has been provided for this image
In [87]:
data['y_diff'] = data['Weight'].diff()
In [88]:
diff_for_df_test = pd.DataFrame(data, columns=['Import Date','y_diff'])
In [89]:
diff_for_df_test.head()
Out[89]:
Import Date y_diff
0 2020-01-01 NaN
1 2020-02-01 41103475.0
2 2020-03-01 -1373008.0
3 2020-04-01 -45070426.0
4 2020-05-01 -14343865.0

Passes Dickey Fuller Test after differencing 1 time (stationary now). Do not need to difference again. Will set d=1.

In [90]:
adf_b4, pvalue_b4, usedlag_, nobs_, critical_values_, icbest_ = adfuller(diff_for_df_test['y_diff'].dropna())

print(f"ADF: {adf_b4:.2f}")
print(f"p-value: {pvalue_b4:.8f}")
ADF: -5.83
p-value: 0.00000040

The ideal parameters appear to be (2,1,1) (1,1,1,12) for log data:

In [91]:
plot_acf(np.log1p(data['Weight']), lags=24)
plot_pacf(np.log1p(data['Weight']), lags=24)
plt.show()
No description has been provided for this image
No description has been provided for this image

Differencing with Log (to validate d=1):

In [92]:
data_log = data.copy()
In [93]:
data_log['y_diff'] = np.log1p(data_log['Weight']).diff()
In [94]:
diff_for_df_test_log = pd.DataFrame(data_log, columns=['Import Date','y_diff'])
In [95]:
diff_for_df_test_log.head()
Out[95]:
Import Date y_diff
0 2020-01-01 NaN
1 2020-02-01 0.423887
2 2020-03-01 -0.011608
3 2020-04-01 -0.483324
4 2020-05-01 -0.220374
In [96]:
adf_b4, pvalue_b4, usedlag_, nobs_, critical_values_, icbest_ = adfuller(diff_for_df_test_log['y_diff'].dropna())

print(f"ADF: {adf_b4:.2f}")
print(f"p-value: {pvalue_b4:.4f}")
ADF: -3.08
p-value: 0.0282

d=1 since differencing was solved with 1 difference

Manual model with parameters from above:

In [97]:
sar_manual = sm.tsa.statespace.SARIMAX(data['Weight'],
                                order=(1,1,2),
                                seasonal_order=(1,1,1,12),
                                trend='c').fit()
In [98]:
sm.tsa.graphics.plot_acf(sar_manual.resid[sar_manual.loglikelihood_burn:]);
No description has been provided for this image
In [99]:
sm.tsa.graphics.plot_pacf(sar_manual.resid[sar_manual.loglikelihood_burn:]);
No description has been provided for this image

Diagnostics are pretty good: residuals look like white noise, only slight right skew on histogram of residuals, q-q plot has some deviation at the upper right, but no autocorrelation (!):

In [100]:
sar_manual.plot_diagnostics(figsize = (15,8),lags = 12);
No description has been provided for this image

Inputting ideal manual option (based on visuals/tests):

In [101]:
horizon = 3
manual_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data.iloc[:cutoff+1]

    sar_manual = sm.tsa.SARIMAX(
        train['Weight'],
        order=(1,1,2),
        seasonal_order=(1,1,1,12),
        trend='c',
        use_exact_diffuse=True
    ).fit(disp=False)

    yhat = sar_manual.forecast(steps=horizon)
    y_true = data['Weight'].iloc[cutoff+1: cutoff+1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        manual_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

sarima_manual_cv_df = pd.DataFrame(manual_results)
sarima_manual_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[101]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 -902,180 16,074,568 106%
1 44 2 09-01-2023 11-01-2023 11,741,002 -7,499,335 19,240,337 164%
2 44 3 09-01-2023 12-01-2023 17,305,596 -8,421,289 25,726,885 149%
3 45 1 10-01-2023 11-01-2023 11,741,002 -1,016,610 12,757,612 109%
4 45 2 10-01-2023 12-01-2023 17,305,596 -6,792,540 24,098,136 139%
5 45 3 10-01-2023 01-01-2024 64,702,033 41,323,640 23,378,393 36%
6 46 1 11-01-2023 12-01-2023 17,305,596 -3,223,811 20,529,407 119%
7 46 2 11-01-2023 01-01-2024 64,702,033 40,478,792 24,223,241 37%
8 46 3 11-01-2023 02-01-2024 65,375,575 30,235,118 35,140,457 54%
9 47 1 12-01-2023 01-01-2024 64,702,033 55,283,347 9,418,686 15%
10 47 2 12-01-2023 02-01-2024 65,375,575 47,623,246 17,752,329 27%
11 47 3 12-01-2023 03-01-2024 62,364,919 63,009,160 644,241 1%
12 48 1 01-01-2024 02-01-2024 65,375,575 52,854,378 12,521,197 19%
13 48 2 01-01-2024 03-01-2024 62,364,919 66,474,374 4,109,455 7%
14 48 3 01-01-2024 04-01-2024 34,933,233 65,836,481 30,903,248 88%
15 49 1 02-01-2024 03-01-2024 62,364,919 72,268,490 9,903,571 16%
16 49 2 02-01-2024 04-01-2024 34,933,233 67,093,854 32,160,621 92%
17 49 3 02-01-2024 05-01-2024 35,504,037 38,321,713 2,817,676 8%
18 50 1 03-01-2024 04-01-2024 34,933,233 60,309,139 25,375,906 73%
19 50 2 03-01-2024 05-01-2024 35,504,037 34,852,914 651,123 2%
20 50 3 03-01-2024 06-01-2024 26,935,635 23,152,994 3,782,641 14%
21 51 1 04-01-2024 05-01-2024 35,504,037 21,708,625 13,795,412 39%
22 51 2 04-01-2024 06-01-2024 26,935,635 13,630,593 13,305,042 49%
23 51 3 04-01-2024 07-01-2024 19,175,494 -18,494,471 37,669,965 196%
24 52 1 05-01-2024 06-01-2024 26,935,635 23,568,084 3,367,551 13%
25 52 2 05-01-2024 07-01-2024 19,175,494 -12,111,752 31,287,246 163%
26 52 3 05-01-2024 08-01-2024 21,449,146 -14,159,671 35,608,817 166%
27 53 1 06-01-2024 07-01-2024 19,175,494 -8,456,164 27,631,658 144%
28 53 2 06-01-2024 08-01-2024 21,449,146 -10,905,403 32,354,549 151%
29 53 3 06-01-2024 09-01-2024 23,828,291 -10,683,873 34,512,164 145%
30 54 1 07-01-2024 08-01-2024 21,449,146 3,732,004 17,717,142 83%
31 54 2 07-01-2024 09-01-2024 23,828,291 -3,279,741 27,108,032 114%
32 54 3 07-01-2024 10-01-2024 20,121,692 -15,329,868 35,451,560 176%
33 55 1 08-01-2024 09-01-2024 23,828,291 2,210,775 21,617,516 91%
34 55 2 08-01-2024 10-01-2024 20,121,692 -15,338,954 35,460,646 176%
35 55 3 08-01-2024 11-01-2024 23,311,088 -20,710,154 44,021,242 189%
36 56 1 09-01-2024 10-01-2024 20,121,692 -3,428,949 23,550,641 117%
37 56 2 09-01-2024 11-01-2024 23,311,088 -15,752,620 39,063,708 168%
38 56 3 09-01-2024 12-01-2024 31,759,202 -12,801,763 44,560,965 140%
In [102]:
sarima_manual_cv = sarima_manual_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

sarima_manual_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[102]:
  MAE MAPE
horizon    
1 16,481,605 72%
2 23,139,574 99%
3 27,247,558 105%

The MAPE is extremely high which is surprising, given that this was the intuitively ideal combo of parameters. However, this is on the non-log transformed data.

In [103]:
sarima_manual_overall_MAE  = sarima_manual_cv_df['AE'].mean()
sarima_manual_overall_MAPE = sarima_manual_cv_df['APE'].mean()

print(f"MAE: {sarima_manual_overall_MAE:,.0f}")
print(f"MAPE (%): {sarima_manual_overall_MAPE:.0%}")
MAE: 22,289,579
MAPE (%): 92%

Run Auto-ARIMA to either verify selected parameters or identify more optimal ones:

This "auto" option iterates through options, but with safeguards to min/max parameters based on previous decomposition/visuals/tests:

Forcing d and D to be 1 and letting the rest run:

In [104]:
sarima_auto_model = pm.auto_arima(data['Weight'],
                          start_p=0,
                          d=1,
                          start_q=0,
                          max_p=5,
                          max_d=1,
                          max_q=5,
                          start_P=0,
                          D=1,
                          start_Q=0,
                          max_P=2,
                          max_D=1,
                          max_Q=2,
                          max_order=5,
                          m=12,
                          seasonal=True,
                          stepwise=False,
                          suppress_warnings=True,
                          error_action='trace',
                          trace=True,
                          random_state=42
                          )
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=1739.523, Time=0.00 sec
 ARIMA(0,1,0)(0,1,1)[12]             : AIC=inf, Time=0.03 sec
 ARIMA(0,1,0)(0,1,2)[12]             : AIC=1727.007, Time=0.06 sec
 ARIMA(0,1,0)(1,1,0)[12]             : AIC=1738.266, Time=0.01 sec
 ARIMA(0,1,0)(1,1,1)[12]             : AIC=inf, Time=0.05 sec
 ARIMA(0,1,0)(1,1,2)[12]             : AIC=1727.890, Time=0.08 sec
 ARIMA(0,1,0)(2,1,0)[12]             : AIC=1736.318, Time=0.05 sec
 ARIMA(0,1,0)(2,1,1)[12]             : AIC=1736.740, Time=0.07 sec
 ARIMA(0,1,0)(2,1,2)[12]             : AIC=1727.517, Time=0.12 sec
 ARIMA(0,1,1)(0,1,0)[12]             : AIC=1735.337, Time=0.01 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=1721.028, Time=0.03 sec
 ARIMA(0,1,1)(0,1,2)[12]             : AIC=1720.941, Time=0.08 sec
 ARIMA(0,1,1)(1,1,0)[12]             : AIC=1733.606, Time=0.02 sec
 ARIMA(0,1,1)(1,1,1)[12]             : AIC=1722.198, Time=0.05 sec
 ARIMA(0,1,1)(1,1,2)[12]             : AIC=1721.538, Time=0.10 sec
 ARIMA(0,1,1)(2,1,0)[12]             : AIC=1721.445, Time=0.05 sec
 ARIMA(0,1,1)(2,1,1)[12]             : AIC=1719.400, Time=0.08 sec
 ARIMA(0,1,1)(2,1,2)[12]             : AIC=1721.047, Time=0.18 sec
 ARIMA(0,1,2)(0,1,0)[12]             : AIC=1735.734, Time=0.01 sec
 ARIMA(0,1,2)(0,1,1)[12]             : AIC=1721.761, Time=0.04 sec
 ARIMA(0,1,2)(0,1,2)[12]             : AIC=1721.263, Time=0.05 sec
 ARIMA(0,1,2)(1,1,0)[12]             : AIC=1734.642, Time=0.03 sec
 ARIMA(0,1,2)(1,1,1)[12]             : AIC=1722.765, Time=0.06 sec
 ARIMA(0,1,2)(1,1,2)[12]             : AIC=1721.810, Time=0.08 sec
 ARIMA(0,1,2)(2,1,0)[12]             : AIC=1722.370, Time=0.07 sec
 ARIMA(0,1,2)(2,1,1)[12]             : AIC=1719.626, Time=0.11 sec
 ARIMA(0,1,3)(0,1,0)[12]             : AIC=1738.216, Time=0.01 sec
 ARIMA(0,1,3)(0,1,1)[12]             : AIC=1723.205, Time=0.05 sec
 ARIMA(0,1,3)(0,1,2)[12]             : AIC=inf, Time=0.13 sec
 ARIMA(0,1,3)(1,1,0)[12]             : AIC=1736.785, Time=0.03 sec
 ARIMA(0,1,3)(1,1,1)[12]             : AIC=inf, Time=0.10 sec
 ARIMA(0,1,3)(2,1,0)[12]             : AIC=1724.118, Time=0.08 sec
 ARIMA(0,1,4)(0,1,0)[12]             : AIC=1752.785, Time=0.02 sec
 ARIMA(0,1,4)(0,1,1)[12]             : AIC=inf, Time=0.20 sec
 ARIMA(0,1,4)(1,1,0)[12]             : AIC=1749.054, Time=0.05 sec
 ARIMA(0,1,5)(0,1,0)[12]             : AIC=1749.277, Time=0.03 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=1737.477, Time=0.01 sec
 ARIMA(1,1,0)(0,1,1)[12]             : AIC=1722.759, Time=0.02 sec
 ARIMA(1,1,0)(0,1,2)[12]             : AIC=1723.188, Time=0.08 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=1735.043, Time=0.02 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=1724.120, Time=0.04 sec
 ARIMA(1,1,0)(1,1,2)[12]             : AIC=1723.898, Time=0.10 sec
 ARIMA(1,1,0)(2,1,0)[12]             : AIC=1723.241, Time=0.05 sec
 ARIMA(1,1,0)(2,1,1)[12]             : AIC=1721.606, Time=0.08 sec
 ARIMA(1,1,0)(2,1,2)[12]             : AIC=1723.274, Time=0.18 sec
 ARIMA(1,1,1)(0,1,0)[12]             : AIC=1734.593, Time=0.01 sec
 ARIMA(1,1,1)(0,1,1)[12]             : AIC=1722.593, Time=0.03 sec
 ARIMA(1,1,1)(0,1,2)[12]             : AIC=1722.358, Time=0.09 sec
 ARIMA(1,1,1)(1,1,0)[12]             : AIC=1733.523, Time=0.03 sec
 ARIMA(1,1,1)(1,1,1)[12]             : AIC=1723.656, Time=0.06 sec
 ARIMA(1,1,1)(1,1,2)[12]             : AIC=1723.142, Time=0.16 sec
 ARIMA(1,1,1)(2,1,0)[12]             : AIC=1723.147, Time=0.10 sec
 ARIMA(1,1,1)(2,1,1)[12]             : AIC=1721.285, Time=0.15 sec
 ARIMA(1,1,2)(0,1,0)[12]             : AIC=1737.860, Time=0.02 sec
 ARIMA(1,1,2)(0,1,1)[12]             : AIC=1723.370, Time=0.05 sec
 ARIMA(1,1,2)(0,1,2)[12]             : AIC=1722.906, Time=0.08 sec
 ARIMA(1,1,2)(1,1,0)[12]             : AIC=1736.527, Time=0.04 sec
 ARIMA(1,1,2)(1,1,1)[12]             : AIC=1724.354, Time=0.09 sec
 ARIMA(1,1,2)(2,1,0)[12]             : AIC=1724.165, Time=0.09 sec
 ARIMA(1,1,3)(0,1,0)[12]             : AIC=1740.450, Time=0.03 sec
 ARIMA(1,1,3)(0,1,1)[12]             : AIC=1724.894, Time=0.07 sec
 ARIMA(1,1,3)(1,1,0)[12]             : AIC=1738.903, Time=0.06 sec
 ARIMA(1,1,4)(0,1,0)[12]             : AIC=1754.413, Time=0.05 sec
 ARIMA(2,1,0)(0,1,0)[12]             : AIC=1738.596, Time=0.01 sec
 ARIMA(2,1,0)(0,1,1)[12]             : AIC=1722.994, Time=0.03 sec
 ARIMA(2,1,0)(0,1,2)[12]             : AIC=1723.012, Time=0.10 sec
 ARIMA(2,1,0)(1,1,0)[12]             : AIC=1736.581, Time=0.02 sec
 ARIMA(2,1,0)(1,1,1)[12]             : AIC=1724.229, Time=0.06 sec
 ARIMA(2,1,0)(1,1,2)[12]             : AIC=1723.460, Time=0.11 sec
 ARIMA(2,1,0)(2,1,0)[12]             : AIC=1723.305, Time=0.07 sec
 ARIMA(2,1,0)(2,1,1)[12]             : AIC=1721.126, Time=0.11 sec
 ARIMA(2,1,1)(0,1,0)[12]             : AIC=1736.721, Time=0.02 sec
 ARIMA(2,1,1)(0,1,1)[12]             : AIC=1724.152, Time=0.06 sec
 ARIMA(2,1,1)(0,1,2)[12]             : AIC=1723.896, Time=0.15 sec
 ARIMA(2,1,1)(1,1,0)[12]             : AIC=1735.717, Time=0.06 sec
 ARIMA(2,1,1)(1,1,1)[12]             : AIC=1725.208, Time=0.09 sec
 ARIMA(2,1,1)(2,1,0)[12]             : AIC=1724.794, Time=0.18 sec
 ARIMA(2,1,2)(0,1,0)[12]             : AIC=inf, Time=0.07 sec
 ARIMA(2,1,2)(0,1,1)[12]             : AIC=1725.331, Time=0.08 sec
 ARIMA(2,1,2)(1,1,0)[12]             : AIC=inf, Time=0.25 sec
 ARIMA(2,1,3)(0,1,0)[12]             : AIC=inf, Time=0.11 sec
 ARIMA(3,1,0)(0,1,0)[12]             : AIC=1739.770, Time=0.01 sec
 ARIMA(3,1,0)(0,1,1)[12]             : AIC=1724.591, Time=0.05 sec
 ARIMA(3,1,0)(0,1,2)[12]             : AIC=1724.467, Time=0.13 sec
 ARIMA(3,1,0)(1,1,0)[12]             : AIC=1738.185, Time=0.03 sec
 ARIMA(3,1,0)(1,1,1)[12]             : AIC=1725.766, Time=0.08 sec
 ARIMA(3,1,0)(2,1,0)[12]             : AIC=1724.963, Time=0.05 sec
 ARIMA(3,1,1)(0,1,0)[12]             : AIC=1738.339, Time=0.02 sec
 ARIMA(3,1,1)(0,1,1)[12]             : AIC=1726.101, Time=0.07 sec
 ARIMA(3,1,1)(1,1,0)[12]             : AIC=1737.398, Time=0.06 sec
 ARIMA(3,1,2)(0,1,0)[12]             : AIC=inf, Time=0.09 sec
 ARIMA(4,1,0)(0,1,0)[12]             : AIC=1738.851, Time=0.02 sec
 ARIMA(4,1,0)(0,1,1)[12]             : AIC=1723.257, Time=0.06 sec
 ARIMA(4,1,0)(1,1,0)[12]             : AIC=1737.427, Time=0.05 sec
 ARIMA(4,1,1)(0,1,0)[12]             : AIC=1740.138, Time=0.03 sec
 ARIMA(5,1,0)(0,1,0)[12]             : AIC=1740.185, Time=0.02 sec

Best model:  ARIMA(0,1,1)(2,1,1)[12]          
Total fit time: 6.406 seconds

Best model parameters:

In [105]:
print('order: ',sarima_auto_model.order)
print('seasonal order: ',sarima_auto_model.seasonal_order)
order:  (0, 1, 1)
seasonal order:  (2, 1, 1, 12)
In [106]:
horizon = 3
auto_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data.iloc[:cutoff+1]

    sar_auto = sm.tsa.SARIMAX(
        train['Weight'],
        order=sarima_auto_model.order,
        seasonal_order=sarima_auto_model.seasonal_order,
        trend='c',
        use_exact_diffuse=True
    ).fit(disp=False)

    yhat = sar_auto.forecast(steps=horizon)
    y_true = data['Weight'].iloc[cutoff+1: cutoff+1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        auto_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

sarima_auto_cv_df = pd.DataFrame(auto_results)
sarima_auto_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[106]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 9,171,902 6,000,486 40%
1 44 2 09-01-2023 11-01-2023 11,741,002 -6,703,573 18,444,575 157%
2 44 3 09-01-2023 12-01-2023 17,305,596 21,158 17,284,438 100%
3 45 1 10-01-2023 11-01-2023 11,741,002 -4,836,305 16,577,307 141%
4 45 2 10-01-2023 12-01-2023 17,305,596 2,482,824 14,822,772 86%
5 45 3 10-01-2023 01-01-2024 64,702,033 53,761,074 10,940,959 17%
6 46 1 11-01-2023 12-01-2023 17,305,596 9,707,022 7,598,574 44%
7 46 2 11-01-2023 01-01-2024 64,702,033 61,134,415 3,567,618 6%
8 46 3 11-01-2023 02-01-2024 65,375,575 46,124,995 19,250,580 29%
9 47 1 12-01-2023 01-01-2024 64,702,033 65,319,948 617,915 1%
10 47 2 12-01-2023 02-01-2024 65,375,575 52,069,485 13,306,090 20%
11 47 3 12-01-2023 03-01-2024 62,364,919 86,732,571 24,367,652 39%
12 48 1 01-01-2024 02-01-2024 65,375,575 52,436,703 12,938,872 20%
13 48 2 01-01-2024 03-01-2024 62,364,919 87,391,728 25,026,809 40%
14 48 3 01-01-2024 04-01-2024 34,933,233 74,850,371 39,917,138 114%
15 49 1 02-01-2024 03-01-2024 62,364,919 92,159,107 29,794,188 48%
16 49 2 02-01-2024 04-01-2024 34,933,233 75,064,196 40,130,963 115%
17 49 3 02-01-2024 05-01-2024 35,504,037 46,796,129 11,292,092 32%
18 50 1 03-01-2024 04-01-2024 34,933,233 60,741,215 25,807,982 74%
19 50 2 03-01-2024 05-01-2024 35,504,037 31,769,359 3,734,678 11%
20 50 3 03-01-2024 06-01-2024 26,935,635 12,552,359 14,383,276 53%
21 51 1 04-01-2024 05-01-2024 35,504,037 11,965,763 23,538,274 66%
22 51 2 04-01-2024 06-01-2024 26,935,635 -10,041,439 36,977,074 137%
23 51 3 04-01-2024 07-01-2024 19,175,494 -37,034,263 56,209,757 293%
24 52 1 05-01-2024 06-01-2024 26,935,635 4,928,421 22,007,214 82%
25 52 2 05-01-2024 07-01-2024 19,175,494 -22,805,022 41,980,516 219%
26 52 3 05-01-2024 08-01-2024 21,449,146 -26,290,054 47,739,200 223%
27 53 1 06-01-2024 07-01-2024 19,175,494 -12,536,504 31,711,998 165%
28 53 2 06-01-2024 08-01-2024 21,449,146 -15,708,953 37,158,099 173%
29 53 3 06-01-2024 09-01-2024 23,828,291 -12,987,106 36,815,397 155%
30 54 1 07-01-2024 08-01-2024 21,449,146 -249,107 21,698,253 101%
31 54 2 07-01-2024 09-01-2024 23,828,291 1,400,654 22,427,637 94%
32 54 3 07-01-2024 10-01-2024 20,121,692 -15,332,896 35,454,588 176%
33 55 1 08-01-2024 09-01-2024 23,828,291 15,193,058 8,635,233 36%
34 55 2 08-01-2024 10-01-2024 20,121,692 -1,390,429 21,512,121 107%
35 55 3 08-01-2024 11-01-2024 23,311,088 -5,421,165 28,732,253 123%
36 56 1 09-01-2024 10-01-2024 20,121,692 4,311,734 15,809,958 79%
37 56 2 09-01-2024 11-01-2024 23,311,088 759,819 22,551,269 97%
38 56 3 09-01-2024 12-01-2024 31,759,202 8,134,898 23,624,304 74%
In [107]:
sarima_auto_cv = sarima_auto_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

sarima_auto_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[107]:
  MAE MAPE
horizon    
1 17,133,558 69%
2 23,203,094 97%
3 28,154,741 110%

The MAPE sees improvement with this model:

In [108]:
sarima_auto_overall_MAE  = sarima_auto_cv_df['AE'].mean()
sarima_auto_overall_MAPE = sarima_auto_cv_df['APE'].mean()

print(f"MAE: {sarima_auto_overall_MAE:,.0f}")
print(f"MAPE (%): {sarima_auto_overall_MAPE:.0%}")
MAE: 22,830,464
MAPE (%): 92%

This shows that a simpler non-seasonal component (0,1,0) and a stronger seasonal component (2,1,1,12) fit the data better because seasonality dictates the pattern the most.

Flex full search:

Does not set any values except 12 month seasonality (including no setting for d or D), so it is a very flexible, wide grid search:

In [109]:
sarima_flex_full_auto_model = pm.auto_arima(data['Weight'], start_p=0, start_q=0,
                           max_p=5, max_q=5, m=12,
                           start_P=0, seasonal=True,
                           trace=True,
                           error_action='trace',
                           suppress_warnings=True,
                           stepwise=False)
 ARIMA(0,1,0)(0,0,0)[12] intercept   : AIC=2177.775, Time=0.00 sec
 ARIMA(0,1,0)(0,0,1)[12] intercept   : AIC=2165.766, Time=0.02 sec
 ARIMA(0,1,0)(0,0,2)[12] intercept   : AIC=2167.578, Time=0.03 sec
 ARIMA(0,1,0)(1,0,0)[12] intercept   : AIC=2165.753, Time=0.01 sec
 ARIMA(0,1,0)(1,0,1)[12] intercept   : AIC=2167.895, Time=0.02 sec
 ARIMA(0,1,0)(1,0,2)[12] intercept   : AIC=2168.258, Time=0.10 sec
 ARIMA(0,1,0)(2,0,0)[12] intercept   : AIC=2167.717, Time=0.03 sec
 ARIMA(0,1,0)(2,0,1)[12] intercept   : AIC=2169.315, Time=0.06 sec
 ARIMA(0,1,0)(2,0,2)[12] intercept   : AIC=2169.717, Time=0.08 sec
 ARIMA(0,1,1)(0,0,0)[12] intercept   : AIC=2179.894, Time=0.00 sec
 ARIMA(0,1,1)(0,0,1)[12] intercept   : AIC=2166.436, Time=0.01 sec
 ARIMA(0,1,1)(0,0,2)[12] intercept   : AIC=2168.188, Time=0.03 sec
 ARIMA(0,1,1)(1,0,0)[12] intercept   : AIC=2166.356, Time=0.01 sec
 ARIMA(0,1,1)(1,0,1)[12] intercept   : AIC=2168.056, Time=0.04 sec
 ARIMA(0,1,1)(1,0,2)[12] intercept   : AIC=2168.931, Time=0.16 sec
 ARIMA(0,1,1)(2,0,0)[12] intercept   : AIC=2168.324, Time=0.03 sec
 ARIMA(0,1,1)(2,0,1)[12] intercept   : AIC=2170.126, Time=0.08 sec
 ARIMA(0,1,1)(2,0,2)[12] intercept   : AIC=2170.268, Time=0.11 sec
 ARIMA(0,1,2)(0,0,0)[12] intercept   : AIC=2181.685, Time=0.01 sec
 ARIMA(0,1,2)(0,0,1)[12] intercept   : AIC=2169.206, Time=0.02 sec
 ARIMA(0,1,2)(0,0,2)[12] intercept   : AIC=2170.851, Time=0.04 sec
 ARIMA(0,1,2)(1,0,0)[12] intercept   : AIC=2167.322, Time=0.03 sec
 ARIMA(0,1,2)(1,0,1)[12] intercept   : AIC=2169.289, Time=0.06 sec
 ARIMA(0,1,2)(1,0,2)[12] intercept   : AIC=2170.894, Time=0.19 sec
 ARIMA(0,1,2)(2,0,0)[12] intercept   : AIC=2169.232, Time=0.08 sec
 ARIMA(0,1,2)(2,0,1)[12] intercept   : AIC=2170.714, Time=0.23 sec
 ARIMA(0,1,3)(0,0,0)[12] intercept   : AIC=inf, Time=0.02 sec
 ARIMA(0,1,3)(0,0,1)[12] intercept   : AIC=2163.536, Time=0.04 sec
 ARIMA(0,1,3)(0,0,2)[12] intercept   : AIC=2165.030, Time=0.09 sec
 ARIMA(0,1,3)(1,0,0)[12] intercept   : AIC=2163.944, Time=0.03 sec
 ARIMA(0,1,3)(1,0,1)[12] intercept   : AIC=2165.261, Time=0.08 sec
 ARIMA(0,1,3)(2,0,0)[12] intercept   : AIC=2165.940, Time=0.08 sec
 ARIMA(0,1,4)(0,0,0)[12] intercept   : AIC=2190.432, Time=0.01 sec
 ARIMA(0,1,4)(0,0,1)[12] intercept   : AIC=2166.588, Time=0.03 sec
 ARIMA(0,1,4)(1,0,0)[12] intercept   : AIC=2170.921, Time=0.02 sec
 ARIMA(0,1,5)(0,0,0)[12] intercept   : AIC=2207.147, Time=0.01 sec
 ARIMA(1,1,0)(0,0,0)[12] intercept   : AIC=2179.898, Time=0.00 sec
 ARIMA(1,1,0)(0,0,1)[12] intercept   : AIC=2166.756, Time=0.01 sec
 ARIMA(1,1,0)(0,0,2)[12] intercept   : AIC=2168.507, Time=0.03 sec
 ARIMA(1,1,0)(1,0,0)[12] intercept   : AIC=2166.761, Time=0.01 sec
 ARIMA(1,1,0)(1,0,1)[12] intercept   : AIC=2168.404, Time=0.04 sec
 ARIMA(1,1,0)(1,0,2)[12] intercept   : AIC=2169.811, Time=0.09 sec
 ARIMA(1,1,0)(2,0,0)[12] intercept   : AIC=2168.726, Time=0.03 sec
 ARIMA(1,1,0)(2,0,1)[12] intercept   : AIC=2170.041, Time=0.14 sec
 ARIMA(1,1,0)(2,0,2)[12] intercept   : AIC=2171.055, Time=0.10 sec
 ARIMA(1,1,1)(0,0,0)[12] intercept   : AIC=2181.893, Time=0.01 sec
 ARIMA(1,1,1)(0,0,1)[12] intercept   : AIC=2165.653, Time=0.06 sec
 ARIMA(1,1,1)(0,0,2)[12] intercept   : AIC=2167.068, Time=0.12 sec
 ARIMA(1,1,1)(1,0,0)[12] intercept   : AIC=2166.422, Time=0.04 sec
 ARIMA(1,1,1)(1,0,1)[12] intercept   : AIC=2167.478, Time=0.09 sec
 ARIMA(1,1,1)(1,0,2)[12] intercept   : AIC=2170.507, Time=0.18 sec
 ARIMA(1,1,1)(2,0,0)[12] intercept   : AIC=2168.412, Time=0.12 sec
 ARIMA(1,1,1)(2,0,1)[12] intercept   : AIC=2171.732, Time=0.08 sec
 ARIMA(1,1,2)(0,0,0)[12] intercept   : AIC=2176.453, Time=0.02 sec
 ARIMA(1,1,2)(0,0,1)[12] intercept   : AIC=2166.681, Time=0.03 sec
 ARIMA(1,1,2)(0,0,2)[12] intercept   : AIC=2168.227, Time=0.09 sec
 ARIMA(1,1,2)(1,0,0)[12] intercept   : AIC=2167.104, Time=0.04 sec
 ARIMA(1,1,2)(1,0,1)[12] intercept   : AIC=2168.535, Time=0.06 sec
 ARIMA(1,1,2)(2,0,0)[12] intercept   : AIC=2169.104, Time=0.08 sec
 ARIMA(1,1,3)(0,0,0)[12] intercept   : AIC=2178.578, Time=0.01 sec
 ARIMA(1,1,3)(0,0,1)[12] intercept   : AIC=2165.220, Time=0.03 sec
 ARIMA(1,1,3)(1,0,0)[12] intercept   : AIC=2166.425, Time=0.04 sec
 ARIMA(1,1,4)(0,0,0)[12] intercept   : AIC=2194.552, Time=0.02 sec
 ARIMA(2,1,0)(0,0,0)[12] intercept   : AIC=2181.819, Time=0.01 sec
 ARIMA(2,1,0)(0,0,1)[12] intercept   : AIC=2168.990, Time=0.01 sec
 ARIMA(2,1,0)(0,0,2)[12] intercept   : AIC=2170.728, Time=0.04 sec
 ARIMA(2,1,0)(1,0,0)[12] intercept   : AIC=2168.859, Time=0.02 sec
 ARIMA(2,1,0)(1,0,1)[12] intercept   : AIC=2170.606, Time=0.05 sec
 ARIMA(2,1,0)(1,0,2)[12] intercept   : AIC=2172.016, Time=0.10 sec
 ARIMA(2,1,0)(2,0,0)[12] intercept   : AIC=2170.848, Time=0.04 sec
 ARIMA(2,1,0)(2,0,1)[12] intercept   : AIC=2172.192, Time=0.15 sec
 ARIMA(2,1,1)(0,0,0)[12] intercept   : AIC=2178.184, Time=0.01 sec
 ARIMA(2,1,1)(0,0,1)[12] intercept   : AIC=2166.520, Time=0.03 sec
 ARIMA(2,1,1)(0,0,2)[12] intercept   : AIC=2168.028, Time=0.07 sec
 ARIMA(2,1,1)(1,0,0)[12] intercept   : AIC=2167.254, Time=0.03 sec
 ARIMA(2,1,1)(1,0,1)[12] intercept   : AIC=2168.418, Time=0.05 sec
 ARIMA(2,1,1)(2,0,0)[12] intercept   : AIC=2169.253, Time=0.08 sec
 ARIMA(2,1,2)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(2,1,2)(0,0,1)[12] intercept   : AIC=2168.122, Time=0.06 sec
 ARIMA(2,1,2)(1,0,0)[12] intercept   : AIC=2168.469, Time=0.07 sec
 ARIMA(2,1,3)(0,0,0)[12] intercept   : AIC=2181.756, Time=0.03 sec
 ARIMA(3,1,0)(0,0,0)[12] intercept   : AIC=2183.352, Time=0.01 sec
 ARIMA(3,1,0)(0,0,1)[12] intercept   : AIC=2169.259, Time=0.02 sec
 ARIMA(3,1,0)(0,0,2)[12] intercept   : AIC=2170.921, Time=0.05 sec
 ARIMA(3,1,0)(1,0,0)[12] intercept   : AIC=2168.899, Time=0.02 sec
 ARIMA(3,1,0)(1,0,1)[12] intercept   : AIC=2170.722, Time=0.09 sec
 ARIMA(3,1,0)(2,0,0)[12] intercept   : AIC=2170.885, Time=0.05 sec
 ARIMA(3,1,1)(0,0,0)[12] intercept   : AIC=2175.367, Time=0.02 sec
 ARIMA(3,1,1)(0,0,1)[12] intercept   : AIC=2166.695, Time=0.05 sec
 ARIMA(3,1,1)(1,0,0)[12] intercept   : AIC=2167.554, Time=0.05 sec
 ARIMA(3,1,2)(0,0,0)[12] intercept   : AIC=2172.477, Time=0.04 sec
 ARIMA(4,1,0)(0,0,0)[12] intercept   : AIC=2181.824, Time=0.01 sec
 ARIMA(4,1,0)(0,0,1)[12] intercept   : AIC=2168.351, Time=0.03 sec
 ARIMA(4,1,0)(1,0,0)[12] intercept   : AIC=2167.570, Time=0.03 sec
 ARIMA(4,1,1)(0,0,0)[12] intercept   : AIC=2173.096, Time=0.02 sec
 ARIMA(5,1,0)(0,0,0)[12] intercept   : AIC=2183.108, Time=0.01 sec

Best model:  ARIMA(0,1,3)(0,0,1)[12] intercept
Total fit time: 4.848 seconds
In [110]:
print('order: ',sarima_flex_full_auto_model.order)
print('seasonal order: ',sarima_flex_full_auto_model.seasonal_order)
order:  (0, 1, 3)
seasonal order:  (0, 0, 1, 12)
In [111]:
horizon = 3
flex_full_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data.iloc[:cutoff+1]

    sar_flex_full = sm.tsa.SARIMAX(
        train['Weight'],
        order=sarima_flex_full_auto_model.order,
        seasonal_order=sarima_flex_full_auto_model.seasonal_order,
        trend='c',
        use_exact_diffuse=True
    ).fit(disp=False)

    yhat = sar_flex_full.forecast(steps=horizon)
    y_true = data['Weight'].iloc[cutoff+1: cutoff+1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        flex_full_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

sarima_flex_full_cv_df = pd.DataFrame(flex_full_results)
sarima_flex_full_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[111]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 32,338,681 17,166,293 113%
1 44 2 09-01-2023 11-01-2023 11,741,002 25,880,186 14,139,184 120%
2 44 3 09-01-2023 12-01-2023 17,305,596 29,161,960 11,856,364 69%
3 45 1 10-01-2023 11-01-2023 11,741,002 16,121,887 4,380,885 37%
4 45 2 10-01-2023 12-01-2023 17,305,596 20,459,783 3,154,187 18%
5 45 3 10-01-2023 01-01-2024 64,702,033 46,031,411 18,670,622 29%
6 46 1 11-01-2023 12-01-2023 17,305,596 17,513,162 207,566 1%
7 46 2 11-01-2023 01-01-2024 64,702,033 43,713,798 20,988,235 32%
8 46 3 11-01-2023 02-01-2024 65,375,575 46,084,618 19,290,957 30%
9 47 1 12-01-2023 01-01-2024 64,702,033 44,154,691 20,547,342 32%
10 47 2 12-01-2023 02-01-2024 65,375,575 46,621,726 18,753,849 29%
11 47 3 12-01-2023 03-01-2024 62,364,919 51,571,601 10,793,318 17%
12 48 1 01-01-2024 02-01-2024 65,375,575 64,160,331 1,215,244 2%
13 48 2 01-01-2024 03-01-2024 62,364,919 72,742,594 10,377,675 17%
14 48 3 01-01-2024 04-01-2024 34,933,233 63,100,434 28,167,201 81%
15 49 1 02-01-2024 03-01-2024 62,364,919 73,723,208 11,358,289 18%
16 49 2 02-01-2024 04-01-2024 34,933,233 63,800,067 28,866,834 83%
17 49 3 02-01-2024 05-01-2024 35,504,037 48,208,784 12,704,747 36%
18 50 1 03-01-2024 04-01-2024 34,933,233 57,072,082 22,138,849 63%
19 50 2 03-01-2024 05-01-2024 35,504,037 42,185,222 6,681,185 19%
20 50 3 03-01-2024 06-01-2024 26,935,635 30,789,640 3,854,005 14%
21 51 1 04-01-2024 05-01-2024 35,504,037 29,899,045 5,604,992 16%
22 51 2 04-01-2024 06-01-2024 26,935,635 20,381,082 6,554,553 24%
23 51 3 04-01-2024 07-01-2024 19,175,494 27,261,872 8,086,378 42%
24 52 1 05-01-2024 06-01-2024 26,935,635 22,719,930 4,215,705 16%
25 52 2 05-01-2024 07-01-2024 19,175,494 28,662,119 9,486,625 49%
26 52 3 05-01-2024 08-01-2024 21,449,146 28,481,821 7,032,675 33%
27 53 1 06-01-2024 07-01-2024 19,175,494 29,251,524 10,076,030 53%
28 53 2 06-01-2024 08-01-2024 21,449,146 27,861,048 6,411,902 30%
29 53 3 06-01-2024 09-01-2024 23,828,291 24,449,049 620,758 3%
30 54 1 07-01-2024 08-01-2024 21,449,146 20,220,672 1,228,474 6%
31 54 2 07-01-2024 09-01-2024 23,828,291 16,707,720 7,120,571 30%
32 54 3 07-01-2024 10-01-2024 20,121,692 13,966,006 6,155,686 31%
33 55 1 08-01-2024 09-01-2024 23,828,291 18,496,186 5,332,105 22%
34 55 2 08-01-2024 10-01-2024 20,121,692 16,149,192 3,972,500 20%
35 55 3 08-01-2024 11-01-2024 23,311,088 18,531,665 4,779,423 21%
36 56 1 09-01-2024 10-01-2024 20,121,692 19,392,509 729,183 4%
37 56 2 09-01-2024 11-01-2024 23,311,088 21,489,295 1,821,793 8%
38 56 3 09-01-2024 12-01-2024 31,759,202 22,405,246 9,353,956 29%
In [112]:
sarima_flex_full_cv = sarima_flex_full_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

sarima_flex_full_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[112]:
  MAE MAPE
horizon    
1 8,015,458 29%
2 10,640,699 37%
3 10,874,315 33%

The performance significantly improves with the fully flexible auto arima grid search. This is almost as good as the best Triple Exponential Smoothing model earlier that had 27% MAPE:

In [113]:
sarima_flex_full_overall_MAE  = sarima_flex_full_cv_df['AE'].mean()
sarima_flex_full_overall_MAPE = sarima_flex_full_cv_df['APE'].mean()

print(f"MAE: {sarima_flex_full_overall_MAE:,.0f}")
print(f"MAPE (%): {sarima_flex_full_overall_MAPE:.0%}")
MAE: 9,843,491
MAPE (%): 33%

Again, I will also try a log transformation on the data and refit the models. First I look at the decomposition again with the log transformed data, and it does not look different, so my intuitively ideal manual version should not be different for d and m (will have other differences based on the autocorrelation):

In [114]:
decomp_raw = seasonal_decompose(data['Weight'], model='multiplicative', period=12)

data['Weight_log'] = np.log1p(data['Weight'])
decomp_log = seasonal_decompose(data['Weight_log'], model='additive', period=12)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

decomp_raw.trend.plot(ax=axes[0,0], color='black'); axes[0,0].set_title('Raw Trend')
decomp_log.trend.plot(ax=axes[0,1], color='black'); axes[0,1].set_title('Log Trend')
decomp_raw.seasonal.plot(ax=axes[1,0], color='black'); axes[1,0].set_title('Raw Seasonal')
decomp_log.seasonal.plot(ax=axes[1,1], color='black'); axes[1,1].set_title('Log Seasonal')

plt.tight_layout()
plt.show()
No description has been provided for this image

Run auto arima on log data (but set d's to have some boundaries based on observation of the data):

In [115]:
sarima_log_auto = pm.auto_arima(np.log1p(data['Weight']),
                           start_p=0, start_q=0,
                           max_p=5, max_q=5, m=12,
                           start_P=0, seasonal=True,
                           d=1, D=1,
                           trace=True,
                           error_action='ignore',
                           suppress_warnings=True,
                           stepwise=False)
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=53.203, Time=0.01 sec
 ARIMA(0,1,0)(0,1,1)[12]             : AIC=inf, Time=0.04 sec
 ARIMA(0,1,0)(0,1,2)[12]             : AIC=inf, Time=0.13 sec
 ARIMA(0,1,0)(1,1,0)[12]             : AIC=53.361, Time=0.02 sec
 ARIMA(0,1,0)(1,1,1)[12]             : AIC=inf, Time=0.09 sec
 ARIMA(0,1,0)(1,1,2)[12]             : AIC=inf, Time=0.30 sec
 ARIMA(0,1,0)(2,1,0)[12]             : AIC=42.788, Time=0.07 sec
 ARIMA(0,1,0)(2,1,1)[12]             : AIC=inf, Time=0.26 sec
 ARIMA(0,1,0)(2,1,2)[12]             : AIC=42.631, Time=0.20 sec
 ARIMA(0,1,1)(0,1,0)[12]             : AIC=49.652, Time=0.01 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=inf, Time=0.07 sec
 ARIMA(0,1,1)(0,1,2)[12]             : AIC=inf, Time=0.27 sec
 ARIMA(0,1,1)(1,1,0)[12]             : AIC=49.971, Time=0.03 sec
 ARIMA(0,1,1)(1,1,1)[12]             : AIC=inf, Time=0.11 sec
 ARIMA(0,1,1)(1,1,2)[12]             : AIC=inf, Time=0.26 sec
 ARIMA(0,1,1)(2,1,0)[12]             : AIC=41.387, Time=0.12 sec
 ARIMA(0,1,1)(2,1,1)[12]             : AIC=inf, Time=0.29 sec
 ARIMA(0,1,1)(2,1,2)[12]             : AIC=40.412, Time=0.43 sec
 ARIMA(0,1,2)(0,1,0)[12]             : AIC=50.687, Time=0.01 sec
 ARIMA(0,1,2)(0,1,1)[12]             : AIC=inf, Time=0.10 sec
 ARIMA(0,1,2)(0,1,2)[12]             : AIC=inf, Time=0.18 sec
 ARIMA(0,1,2)(1,1,0)[12]             : AIC=50.265, Time=0.03 sec
 ARIMA(0,1,2)(1,1,1)[12]             : AIC=inf, Time=0.20 sec
 ARIMA(0,1,2)(1,1,2)[12]             : AIC=inf, Time=0.14 sec
 ARIMA(0,1,2)(2,1,0)[12]             : AIC=42.050, Time=0.14 sec
 ARIMA(0,1,2)(2,1,1)[12]             : AIC=inf, Time=0.41 sec
 ARIMA(0,1,3)(0,1,0)[12]             : AIC=50.819, Time=0.03 sec
 ARIMA(0,1,3)(0,1,1)[12]             : AIC=inf, Time=0.23 sec
 ARIMA(0,1,3)(0,1,2)[12]             : AIC=inf, Time=0.61 sec
 ARIMA(0,1,3)(1,1,0)[12]             : AIC=50.793, Time=0.05 sec
 ARIMA(0,1,3)(1,1,1)[12]             : AIC=inf, Time=0.17 sec
 ARIMA(0,1,3)(2,1,0)[12]             : AIC=43.785, Time=0.22 sec
 ARIMA(0,1,4)(0,1,0)[12]             : AIC=inf, Time=0.10 sec
 ARIMA(0,1,4)(0,1,1)[12]             : AIC=inf, Time=0.26 sec
 ARIMA(0,1,4)(1,1,0)[12]             : AIC=inf, Time=0.23 sec
 ARIMA(0,1,5)(0,1,0)[12]             : AIC=inf, Time=0.11 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=50.840, Time=0.01 sec
 ARIMA(1,1,0)(0,1,1)[12]             : AIC=inf, Time=0.11 sec
 ARIMA(1,1,0)(0,1,2)[12]             : AIC=inf, Time=0.22 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=51.450, Time=0.02 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=inf, Time=0.13 sec
 ARIMA(1,1,0)(1,1,2)[12]             : AIC=inf, Time=0.37 sec
 ARIMA(1,1,0)(2,1,0)[12]             : AIC=42.415, Time=0.08 sec
 ARIMA(1,1,0)(2,1,1)[12]             : AIC=inf, Time=0.33 sec
 ARIMA(1,1,0)(2,1,2)[12]             : AIC=42.016, Time=0.20 sec
 ARIMA(1,1,1)(0,1,0)[12]             : AIC=inf, Time=0.05 sec
 ARIMA(1,1,1)(0,1,1)[12]             : AIC=inf, Time=0.12 sec
 ARIMA(1,1,1)(0,1,2)[12]             : AIC=inf, Time=0.31 sec
 ARIMA(1,1,1)(1,1,0)[12]             : AIC=inf, Time=0.09 sec
 ARIMA(1,1,1)(1,1,1)[12]             : AIC=inf, Time=0.21 sec
 ARIMA(1,1,1)(1,1,2)[12]             : AIC=inf, Time=0.50 sec
 ARIMA(1,1,1)(2,1,0)[12]             : AIC=inf, Time=0.42 sec
 ARIMA(1,1,1)(2,1,1)[12]             : AIC=inf, Time=0.59 sec
 ARIMA(1,1,2)(0,1,0)[12]             : AIC=inf, Time=0.06 sec
 ARIMA(1,1,2)(0,1,1)[12]             : AIC=inf, Time=0.23 sec
 ARIMA(1,1,2)(0,1,2)[12]             : AIC=inf, Time=0.27 sec
 ARIMA(1,1,2)(1,1,0)[12]             : AIC=inf, Time=0.12 sec
 ARIMA(1,1,2)(1,1,1)[12]             : AIC=inf, Time=0.26 sec
 ARIMA(1,1,2)(2,1,0)[12]             : AIC=inf, Time=0.57 sec
 ARIMA(1,1,3)(0,1,0)[12]             : AIC=inf, Time=0.10 sec
 ARIMA(1,1,3)(0,1,1)[12]             : AIC=inf, Time=0.19 sec
 ARIMA(1,1,3)(1,1,0)[12]             : AIC=inf, Time=0.21 sec
 ARIMA(1,1,4)(0,1,0)[12]             : AIC=inf, Time=0.10 sec
 ARIMA(2,1,0)(0,1,0)[12]             : AIC=52.606, Time=0.01 sec
 ARIMA(2,1,0)(0,1,1)[12]             : AIC=inf, Time=0.10 sec
 ARIMA(2,1,0)(0,1,2)[12]             : AIC=inf, Time=0.25 sec
 ARIMA(2,1,0)(1,1,0)[12]             : AIC=53.020, Time=0.03 sec
 ARIMA(2,1,0)(1,1,1)[12]             : AIC=inf, Time=0.24 sec
 ARIMA(2,1,0)(1,1,2)[12]             : AIC=inf, Time=0.28 sec
 ARIMA(2,1,0)(2,1,0)[12]             : AIC=43.435, Time=0.13 sec
 ARIMA(2,1,0)(2,1,1)[12]             : AIC=inf, Time=0.36 sec
 ARIMA(2,1,1)(0,1,0)[12]             : AIC=inf, Time=0.05 sec
 ARIMA(2,1,1)(0,1,1)[12]             : AIC=inf, Time=0.16 sec
 ARIMA(2,1,1)(0,1,2)[12]             : AIC=inf, Time=0.37 sec
 ARIMA(2,1,1)(1,1,0)[12]             : AIC=inf, Time=0.20 sec
 ARIMA(2,1,1)(1,1,1)[12]             : AIC=inf, Time=0.22 sec
 ARIMA(2,1,1)(2,1,0)[12]             : AIC=inf, Time=0.44 sec
 ARIMA(2,1,2)(0,1,0)[12]             : AIC=inf, Time=0.09 sec
 ARIMA(2,1,2)(0,1,1)[12]             : AIC=inf, Time=0.25 sec
 ARIMA(2,1,2)(1,1,0)[12]             : AIC=inf, Time=0.24 sec
 ARIMA(2,1,3)(0,1,0)[12]             : AIC=inf, Time=0.17 sec
 ARIMA(3,1,0)(0,1,0)[12]             : AIC=53.024, Time=0.01 sec
 ARIMA(3,1,0)(0,1,1)[12]             : AIC=inf, Time=0.11 sec
 ARIMA(3,1,0)(0,1,2)[12]             : AIC=inf, Time=0.41 sec
 ARIMA(3,1,0)(1,1,0)[12]             : AIC=53.495, Time=0.03 sec
 ARIMA(3,1,0)(1,1,1)[12]             : AIC=inf, Time=0.18 sec
 ARIMA(3,1,0)(2,1,0)[12]             : AIC=44.831, Time=0.10 sec
 ARIMA(3,1,1)(0,1,0)[12]             : AIC=52.793, Time=0.03 sec
 ARIMA(3,1,1)(0,1,1)[12]             : AIC=inf, Time=0.20 sec
 ARIMA(3,1,1)(1,1,0)[12]             : AIC=52.657, Time=0.08 sec
 ARIMA(3,1,2)(0,1,0)[12]             : AIC=inf, Time=0.09 sec
 ARIMA(4,1,0)(0,1,0)[12]             : AIC=51.680, Time=0.01 sec
 ARIMA(4,1,0)(0,1,1)[12]             : AIC=inf, Time=0.25 sec
 ARIMA(4,1,0)(1,1,0)[12]             : AIC=51.126, Time=0.06 sec
 ARIMA(4,1,1)(0,1,0)[12]             : AIC=53.524, Time=0.03 sec
 ARIMA(5,1,0)(0,1,0)[12]             : AIC=53.380, Time=0.02 sec

Best model:  ARIMA(0,1,1)(2,1,2)[12]          
Total fit time: 17.021 seconds
In [116]:
print('order: ',sarima_log_auto.order)
print('seasonal order: ',sarima_log_auto.seasonal_order)
order:  (0, 1, 1)
seasonal order:  (2, 1, 2, 12)
In [117]:
horizon = 3
log_auto_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data.iloc[:cutoff+1]

    sarima_log_auto_model = sm.tsa.SARIMAX(
        np.log1p(train['Weight']),
        order=sarima_log_auto.order,
        seasonal_order=sarima_log_auto.seasonal_order,
        trend='c',
        use_exact_diffuse=True
    ).fit(disp=False)

    yhat = np.expm1(sarima_log_auto_model.forecast(steps=horizon))
    y_true = data['Weight'].iloc[cutoff+1: cutoff+1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        log_auto_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

sarima_log_auto_cv_df = pd.DataFrame(log_auto_results)
sarima_log_auto_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[117]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 18,132,359 2,959,971 20%
1 44 2 09-01-2023 11-01-2023 11,741,002 11,188,104 552,898 5%
2 44 3 09-01-2023 12-01-2023 17,305,596 15,839,031 1,466,565 8%
3 45 1 10-01-2023 11-01-2023 11,741,002 10,024,694 1,716,308 15%
4 45 2 10-01-2023 12-01-2023 17,305,596 14,694,612 2,610,984 15%
5 45 3 10-01-2023 01-01-2024 64,702,033 33,970,825 30,731,208 47%
6 46 1 11-01-2023 12-01-2023 17,305,596 16,423,567 882,029 5%
7 46 2 11-01-2023 01-01-2024 64,702,033 37,164,728 27,537,305 43%
8 46 3 11-01-2023 02-01-2024 65,375,575 32,163,379 33,212,196 51%
9 47 1 12-01-2023 01-01-2024 64,702,033 38,181,643 26,520,390 41%
10 47 2 12-01-2023 02-01-2024 65,375,575 32,519,650 32,855,925 50%
11 47 3 12-01-2023 03-01-2024 62,364,919 44,945,535 17,419,384 28%
12 48 1 01-01-2024 02-01-2024 65,375,575 36,089,098 29,286,477 45%
13 48 2 01-01-2024 03-01-2024 62,364,919 51,142,539 11,222,380 18%
14 48 3 01-01-2024 04-01-2024 34,933,233 42,249,976 7,316,743 21%
15 49 1 02-01-2024 03-01-2024 62,364,919 69,450,790 7,085,871 11%
16 49 2 02-01-2024 04-01-2024 34,933,233 46,086,651 11,153,418 32%
17 49 3 02-01-2024 05-01-2024 35,504,037 34,536,974 967,063 3%
18 50 1 03-01-2024 04-01-2024 34,933,233 42,693,625 7,760,392 22%
19 50 2 03-01-2024 05-01-2024 35,504,037 31,879,381 3,624,656 10%
20 50 3 03-01-2024 06-01-2024 26,935,635 27,200,542 264,907 1%
21 51 1 04-01-2024 05-01-2024 35,504,037 29,412,610 6,091,427 17%
22 51 2 04-01-2024 06-01-2024 26,935,635 26,180,885 754,750 3%
23 51 3 04-01-2024 07-01-2024 19,175,494 14,965,298 4,210,196 22%
24 52 1 05-01-2024 06-01-2024 26,935,635 27,846,940 911,305 3%
25 52 2 05-01-2024 07-01-2024 19,175,494 15,837,472 3,338,022 17%
26 52 3 05-01-2024 08-01-2024 21,449,146 13,438,783 8,010,363 37%
27 53 1 06-01-2024 07-01-2024 19,175,494 15,710,658 3,464,836 18%
28 53 2 06-01-2024 08-01-2024 21,449,146 13,269,210 8,179,936 38%
29 53 3 06-01-2024 09-01-2024 23,828,291 17,104,520 6,723,771 28%
30 54 1 07-01-2024 08-01-2024 21,449,146 13,818,446 7,630,700 36%
31 54 2 07-01-2024 09-01-2024 23,828,291 18,154,813 5,673,478 24%
32 54 3 07-01-2024 10-01-2024 20,121,692 9,713,765 10,407,927 52%
33 55 1 08-01-2024 09-01-2024 23,828,291 19,020,992 4,807,299 20%
34 55 2 08-01-2024 10-01-2024 20,121,692 10,760,367 9,361,325 47%
35 55 3 08-01-2024 11-01-2024 23,311,088 9,080,584 14,230,504 61%
36 56 1 09-01-2024 10-01-2024 20,121,692 12,840,900 7,280,792 36%
37 56 2 09-01-2024 11-01-2024 23,311,088 10,562,908 12,748,180 55%
38 56 3 09-01-2024 12-01-2024 31,759,202 17,556,889 14,202,313 45%
In [118]:
sarima_log_auto_cv = sarima_log_auto_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

sarima_log_auto_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[118]:
  MAE MAPE
horizon    
1 8,184,446 22%
2 9,970,251 27%
3 11,474,088 31%

Performance improved significantly to the same overall MAPE as the best performing Triple Exponential Smoothing model from earlier:

In [119]:
sarima_log_auto_overall_MAE  = sarima_log_auto_cv_df['AE'].mean()
sarima_log_auto_overall_MAPE = sarima_log_auto_cv_df['APE'].mean()

print(f"MAE: {sarima_log_auto_overall_MAE:,.0f}")
print(f"MAPE: {sarima_log_auto_overall_MAPE:.0%}")
MAE: 9,876,261
MAPE: 27%

Full flexible auto arima - not setting anything (not setting d's):

In [120]:
sarima_log_full_flex = pm.auto_arima(np.log1p(data['Weight']),
                           start_p=0, start_q=0,
                           max_p=5, max_q=5, m=12,
                           start_P=0, seasonal=True,
                           trace=True,
                           error_action='ignore',
                           suppress_warnings=True,
                           stepwise=False)
 ARIMA(0,1,0)(0,0,0)[12] intercept   : AIC=74.720, Time=0.01 sec
 ARIMA(0,1,0)(0,0,1)[12] intercept   : AIC=58.069, Time=0.02 sec
 ARIMA(0,1,0)(0,0,2)[12] intercept   : AIC=60.069, Time=0.07 sec
 ARIMA(0,1,0)(1,0,0)[12] intercept   : AIC=59.645, Time=0.02 sec
 ARIMA(0,1,0)(1,0,1)[12] intercept   : AIC=inf, Time=0.13 sec
 ARIMA(0,1,0)(1,0,2)[12] intercept   : AIC=inf, Time=0.21 sec
 ARIMA(0,1,0)(2,0,0)[12] intercept   : AIC=61.620, Time=0.06 sec
 ARIMA(0,1,0)(2,0,1)[12] intercept   : AIC=inf, Time=0.11 sec
 ARIMA(0,1,0)(2,0,2)[12] intercept   : AIC=inf, Time=0.23 sec
 ARIMA(0,1,1)(0,0,0)[12] intercept   : AIC=76.596, Time=0.01 sec
 ARIMA(0,1,1)(0,0,1)[12] intercept   : AIC=59.177, Time=0.03 sec
 ARIMA(0,1,1)(0,0,2)[12] intercept   : AIC=61.165, Time=0.07 sec
 ARIMA(0,1,1)(1,0,0)[12] intercept   : AIC=59.840, Time=0.03 sec
 ARIMA(0,1,1)(1,0,1)[12] intercept   : AIC=inf, Time=0.15 sec
 ARIMA(0,1,1)(1,0,2)[12] intercept   : AIC=inf, Time=0.28 sec
 ARIMA(0,1,1)(2,0,0)[12] intercept   : AIC=61.839, Time=0.06 sec
 ARIMA(0,1,1)(2,0,1)[12] intercept   : AIC=60.055, Time=0.16 sec
 ARIMA(0,1,1)(2,0,2)[12] intercept   : AIC=inf, Time=0.30 sec
 ARIMA(0,1,2)(0,0,0)[12] intercept   : AIC=78.331, Time=0.01 sec
 ARIMA(0,1,2)(0,0,1)[12] intercept   : AIC=60.751, Time=0.04 sec
 ARIMA(0,1,2)(0,0,2)[12] intercept   : AIC=62.720, Time=0.09 sec
 ARIMA(0,1,2)(1,0,0)[12] intercept   : AIC=inf, Time=0.10 sec
 ARIMA(0,1,2)(1,0,1)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(0,1,2)(1,0,2)[12] intercept   : AIC=inf, Time=0.33 sec
 ARIMA(0,1,2)(2,0,0)[12] intercept   : AIC=inf, Time=0.24 sec
 ARIMA(0,1,2)(2,0,1)[12] intercept   : AIC=inf, Time=0.29 sec
 ARIMA(0,1,3)(0,0,0)[12] intercept   : AIC=inf, Time=0.07 sec
 ARIMA(0,1,3)(0,0,1)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(0,1,3)(0,0,2)[12] intercept   : AIC=inf, Time=0.23 sec
 ARIMA(0,1,3)(1,0,0)[12] intercept   : AIC=inf, Time=0.11 sec
 ARIMA(0,1,3)(1,0,1)[12] intercept   : AIC=inf, Time=0.16 sec
 ARIMA(0,1,3)(2,0,0)[12] intercept   : AIC=inf, Time=0.31 sec
 ARIMA(0,1,4)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(0,1,4)(0,0,1)[12] intercept   : AIC=inf, Time=0.13 sec
 ARIMA(0,1,4)(1,0,0)[12] intercept   : AIC=inf, Time=0.15 sec
 ARIMA(0,1,5)(0,0,0)[12] intercept   : AIC=inf, Time=0.06 sec
 ARIMA(1,1,0)(0,0,0)[12] intercept   : AIC=76.610, Time=0.01 sec
 ARIMA(1,1,0)(0,0,1)[12] intercept   : AIC=59.273, Time=0.03 sec
 ARIMA(1,1,0)(0,0,2)[12] intercept   : AIC=61.265, Time=0.06 sec
 ARIMA(1,1,0)(1,0,0)[12] intercept   : AIC=60.149, Time=0.02 sec
 ARIMA(1,1,0)(1,0,1)[12] intercept   : AIC=inf, Time=0.11 sec
 ARIMA(1,1,0)(1,0,2)[12] intercept   : AIC=inf, Time=0.24 sec
 ARIMA(1,1,0)(2,0,0)[12] intercept   : AIC=62.148, Time=0.07 sec
 ARIMA(1,1,0)(2,0,1)[12] intercept   : AIC=inf, Time=0.17 sec
 ARIMA(1,1,0)(2,0,2)[12] intercept   : AIC=inf, Time=0.29 sec
 ARIMA(1,1,1)(0,0,0)[12] intercept   : AIC=inf, Time=0.03 sec
 ARIMA(1,1,1)(0,0,1)[12] intercept   : AIC=inf, Time=0.09 sec
 ARIMA(1,1,1)(0,0,2)[12] intercept   : AIC=inf, Time=0.20 sec
 ARIMA(1,1,1)(1,0,0)[12] intercept   : AIC=inf, Time=0.09 sec
 ARIMA(1,1,1)(1,0,1)[12] intercept   : AIC=55.142, Time=0.11 sec
 ARIMA(1,1,1)(1,0,2)[12] intercept   : AIC=inf, Time=0.29 sec
 ARIMA(1,1,1)(2,0,0)[12] intercept   : AIC=inf, Time=0.24 sec
 ARIMA(1,1,1)(2,0,1)[12] intercept   : AIC=inf, Time=0.32 sec
 ARIMA(1,1,2)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(1,1,2)(0,0,1)[12] intercept   : AIC=inf, Time=0.10 sec
 ARIMA(1,1,2)(0,0,2)[12] intercept   : AIC=inf, Time=0.22 sec
 ARIMA(1,1,2)(1,0,0)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(1,1,2)(1,0,1)[12] intercept   : AIC=inf, Time=0.13 sec
 ARIMA(1,1,2)(2,0,0)[12] intercept   : AIC=55.573, Time=0.27 sec
 ARIMA(1,1,3)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(1,1,3)(0,0,1)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(1,1,3)(1,0,0)[12] intercept   : AIC=inf, Time=0.13 sec
 ARIMA(1,1,4)(0,0,0)[12] intercept   : AIC=inf, Time=0.05 sec
 ARIMA(2,1,0)(0,0,0)[12] intercept   : AIC=78.450, Time=0.01 sec
 ARIMA(2,1,0)(0,0,1)[12] intercept   : AIC=61.165, Time=0.04 sec
 ARIMA(2,1,0)(0,0,2)[12] intercept   : AIC=63.157, Time=0.08 sec
 ARIMA(2,1,0)(1,0,0)[12] intercept   : AIC=61.923, Time=0.04 sec
 ARIMA(2,1,0)(1,0,1)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(2,1,0)(1,0,2)[12] intercept   : AIC=inf, Time=0.32 sec
 ARIMA(2,1,0)(2,0,0)[12] intercept   : AIC=63.922, Time=0.08 sec
 ARIMA(2,1,0)(2,0,1)[12] intercept   : AIC=inf, Time=0.29 sec
 ARIMA(2,1,1)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(2,1,1)(0,0,1)[12] intercept   : AIC=inf, Time=0.08 sec
 ARIMA(2,1,1)(0,0,2)[12] intercept   : AIC=inf, Time=0.21 sec
 ARIMA(2,1,1)(1,0,0)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(2,1,1)(1,0,1)[12] intercept   : AIC=inf, Time=0.14 sec
 ARIMA(2,1,1)(2,0,0)[12] intercept   : AIC=inf, Time=0.29 sec
 ARIMA(2,1,2)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(2,1,2)(0,0,1)[12] intercept   : AIC=inf, Time=0.11 sec
 ARIMA(2,1,2)(1,0,0)[12] intercept   : AIC=inf, Time=0.10 sec
 ARIMA(2,1,3)(0,0,0)[12] intercept   : AIC=inf, Time=0.05 sec
 ARIMA(3,1,0)(0,0,0)[12] intercept   : AIC=80.360, Time=0.01 sec
 ARIMA(3,1,0)(0,0,1)[12] intercept   : AIC=62.262, Time=0.05 sec
 ARIMA(3,1,0)(0,0,2)[12] intercept   : AIC=64.229, Time=0.09 sec
 ARIMA(3,1,0)(1,0,0)[12] intercept   : AIC=62.851, Time=0.05 sec
 ARIMA(3,1,0)(1,0,1)[12] intercept   : AIC=inf, Time=0.16 sec
 ARIMA(3,1,0)(2,0,0)[12] intercept   : AIC=64.851, Time=0.12 sec
 ARIMA(3,1,1)(0,0,0)[12] intercept   : AIC=inf, Time=0.05 sec
 ARIMA(3,1,1)(0,0,1)[12] intercept   : AIC=inf, Time=0.17 sec
 ARIMA(3,1,1)(1,0,0)[12] intercept   : AIC=inf, Time=0.12 sec
 ARIMA(3,1,2)(0,0,0)[12] intercept   : AIC=inf, Time=0.04 sec
 ARIMA(4,1,0)(0,0,0)[12] intercept   : AIC=81.534, Time=0.01 sec
 ARIMA(4,1,0)(0,0,1)[12] intercept   : AIC=63.438, Time=0.05 sec
 ARIMA(4,1,0)(1,0,0)[12] intercept   : AIC=62.527, Time=0.12 sec
 ARIMA(4,1,1)(0,0,0)[12] intercept   : AIC=inf, Time=0.08 sec
 ARIMA(5,1,0)(0,0,0)[12] intercept   : AIC=82.538, Time=0.02 sec

Best model:  ARIMA(1,1,1)(1,0,1)[12] intercept
Total fit time: 11.417 seconds
In [121]:
print('order: ',sarima_log_full_flex.order)
print('seasonal order: ',sarima_log_full_flex.seasonal_order)
order:  (1, 1, 1)
seasonal order:  (1, 0, 1, 12)
In [122]:
horizon = 3
log_full_flex_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data.iloc[:cutoff+1]

    sarima_log_full_flex_model = sm.tsa.SARIMAX(
        np.log1p(train['Weight']),
        order=sarima_log_full_flex.order,
        seasonal_order=sarima_log_full_flex.seasonal_order,
        trend='c',
        use_exact_diffuse=True
    ).fit(disp=False)

    yhat = np.expm1(sarima_log_full_flex_model.forecast(steps=horizon))
    y_true = data['Weight'].iloc[cutoff+1: cutoff+1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        log_full_flex_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape
        })

sarima_log_full_flex_cv_df = pd.DataFrame(log_full_flex_results)
sarima_log_full_flex_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[122]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 19,022,105 3,849,717 25%
1 44 2 09-01-2023 11-01-2023 11,741,002 16,374,058 4,633,056 39%
2 44 3 09-01-2023 12-01-2023 17,305,596 13,644,569 3,661,027 21%
3 45 1 10-01-2023 11-01-2023 11,741,002 17,196,917 5,455,915 46%
4 45 2 10-01-2023 12-01-2023 17,305,596 21,634,850 4,329,254 25%
5 45 3 10-01-2023 01-01-2024 64,702,033 45,512,077 19,189,956 30%
6 46 1 11-01-2023 12-01-2023 17,305,596 17,042,136 263,460 2%
7 46 2 11-01-2023 01-01-2024 64,702,033 38,918,497 25,783,536 40%
8 46 3 11-01-2023 02-01-2024 65,375,575 39,278,741 26,096,834 40%
9 47 1 12-01-2023 01-01-2024 64,702,033 39,207,900 25,494,133 39%
10 47 2 12-01-2023 02-01-2024 65,375,575 39,392,927 25,982,648 40%
11 47 3 12-01-2023 03-01-2024 62,364,919 50,918,681 11,446,238 18%
12 48 1 01-01-2024 02-01-2024 65,375,575 53,263,101 12,112,474 19%
13 48 2 01-01-2024 03-01-2024 62,364,919 62,239,477 125,442 0%
14 48 3 01-01-2024 04-01-2024 34,933,233 60,362,747 25,429,514 73%
15 49 1 02-01-2024 03-01-2024 62,364,919 70,718,057 8,353,138 13%
16 49 2 02-01-2024 04-01-2024 34,933,233 65,343,812 30,410,579 87%
17 49 3 02-01-2024 05-01-2024 35,504,037 49,088,781 13,584,744 38%
18 50 1 03-01-2024 04-01-2024 34,933,233 61,667,036 26,733,803 77%
19 50 2 03-01-2024 05-01-2024 35,504,037 47,266,024 11,761,987 33%
20 50 3 03-01-2024 06-01-2024 26,935,635 41,055,774 14,120,139 52%
21 51 1 04-01-2024 05-01-2024 35,504,037 29,688,032 5,816,005 16%
22 51 2 04-01-2024 06-01-2024 26,935,635 22,949,698 3,985,937 15%
23 51 3 04-01-2024 07-01-2024 19,175,494 18,947,987 227,507 1%
24 52 1 05-01-2024 06-01-2024 26,935,635 24,509,724 2,425,911 9%
25 52 2 05-01-2024 07-01-2024 19,175,494 20,206,939 1,031,445 5%
26 52 3 05-01-2024 08-01-2024 21,449,146 24,359,621 2,910,475 14%
27 53 1 06-01-2024 07-01-2024 19,175,494 19,586,924 411,430 2%
28 53 2 06-01-2024 08-01-2024 21,449,146 21,227,726 221,420 1%
29 53 3 06-01-2024 09-01-2024 23,828,291 21,642,176 2,186,115 9%
30 54 1 07-01-2024 08-01-2024 21,449,146 20,717,126 732,020 3%
31 54 2 07-01-2024 09-01-2024 23,828,291 20,584,375 3,243,916 14%
32 54 3 07-01-2024 10-01-2024 20,121,692 16,798,862 3,322,830 17%
33 55 1 08-01-2024 09-01-2024 23,828,291 20,391,635 3,436,656 14%
34 55 2 08-01-2024 10-01-2024 20,121,692 16,740,644 3,381,048 17%
35 55 3 08-01-2024 11-01-2024 23,311,088 14,168,457 9,142,631 39%
36 56 1 09-01-2024 10-01-2024 20,121,692 18,667,555 1,454,137 7%
37 56 2 09-01-2024 11-01-2024 23,311,088 15,944,377 7,366,711 32%
38 56 3 09-01-2024 12-01-2024 31,759,202 18,232,002 13,527,200 43%
In [123]:
sarima_log_full_flex_cv = sarima_log_full_flex_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

sarima_log_full_flex_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[123]:
  MAE MAPE
horizon    
1 7,426,062 21%
2 9,404,383 27%
3 11,141,939 30%

Performance is slightly worse with the fully flexible grid search:

In [124]:
sarima_log_full_flex_overall_MAE  = sarima_log_full_flex_cv_df['AE'].mean()
sarima_log_full_flex_overall_MAPE = sarima_log_full_flex_cv_df['APE'].mean()

print(f"MAE: {sarima_log_full_flex_overall_MAE:,.0f}")
print(f"MAPE (%): {sarima_log_full_flex_overall_MAPE:.0%}")
MAE: 9,324,128
MAPE (%): 26%

Switch to manual model:

Ideal with log (based on autocorrelation plots earlier + decomposition):

In [125]:
sarima_log_manual_order = (2,1,1)
sarima_log_manual_seasonal_order = (1,1,1,12)
In [126]:
horizon = 3
log_manual_results = []

for cutoff in range(44, 60):
    if cutoff + horizon >= len(data):
        break

    train = data.iloc[:cutoff+1]

    sarima_log_manual_model = sm.tsa.SARIMAX(
        np.log1p(train['Weight']),
        order=sarima_log_manual_order,
        seasonal_order=sarima_log_manual_seasonal_order,
        trend='c',
        use_exact_diffuse=True
    ).fit(disp=False)

    yhat = np.expm1(sarima_log_manual_model.forecast(steps=horizon))
    y_true = data['Weight'].iloc[cutoff+1: cutoff+1 + horizon]

    for h in range(horizon):
        ae = abs(yhat.iloc[h] - y_true.iloc[h])
        ape = ae / max(abs(y_true.iloc[h]), 1e-8)

        log_manual_results.append({
            'cutoff': cutoff,
            'horizon': h + 1,
            'cutoff_month': data['Import Date'].iloc[cutoff].to_period('M').to_timestamp(),
            'month': data['Import Date'].iloc[cutoff + h + 1].to_period('M').to_timestamp(),
            'y': y_true.iloc[h],
            'yhat': yhat.iloc[h],
            'AE': ae,
            'APE': ape

        })

sarima_log_manual_cv_df = pd.DataFrame(log_manual_results)
sarima_log_manual_cv_df.style.format({
    'cutoff_month': '{:%m-%d-%Y}',
    'month': '{:%m-%d-%Y}',
    'y': '{:,.0f}',
    'yhat': '{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[126]:
  cutoff horizon cutoff_month month y yhat AE APE
0 44 1 09-01-2023 10-01-2023 15,172,388 17,533,105 2,360,717 16%
1 44 2 09-01-2023 11-01-2023 11,741,002 14,875,708 3,134,706 27%
2 44 3 09-01-2023 12-01-2023 17,305,596 15,784,007 1,521,589 9%
3 45 1 10-01-2023 11-01-2023 11,741,002 14,100,880 2,359,878 20%
4 45 2 10-01-2023 12-01-2023 17,305,596 15,694,067 1,611,529 9%
5 45 3 10-01-2023 01-01-2024 64,702,033 38,782,641 25,919,392 40%
6 46 1 11-01-2023 12-01-2023 17,305,596 14,413,347 2,892,249 17%
7 46 2 11-01-2023 01-01-2024 64,702,033 37,130,369 27,571,664 43%
8 46 3 11-01-2023 02-01-2024 65,375,575 33,732,957 31,642,618 48%
9 47 1 12-01-2023 01-01-2024 64,702,033 39,630,134 25,071,899 39%
10 47 2 12-01-2023 02-01-2024 65,375,575 34,422,739 30,952,836 47%
11 47 3 12-01-2023 03-01-2024 62,364,919 42,204,836 20,160,083 32%
12 48 1 01-01-2024 02-01-2024 65,375,575 43,141,021 22,234,554 34%
13 48 2 01-01-2024 03-01-2024 62,364,919 46,246,949 16,117,970 26%
14 48 3 01-01-2024 04-01-2024 34,933,233 42,942,172 8,008,939 23%
15 49 1 02-01-2024 03-01-2024 62,364,919 57,197,381 5,167,538 8%
16 49 2 02-01-2024 04-01-2024 34,933,233 48,828,376 13,895,143 40%
17 49 3 02-01-2024 05-01-2024 35,504,037 33,518,250 1,985,787 6%
18 50 1 03-01-2024 04-01-2024 34,933,233 51,099,745 16,166,512 46%
19 50 2 03-01-2024 05-01-2024 35,504,037 34,455,289 1,048,748 3%
20 50 3 03-01-2024 06-01-2024 26,935,635 27,590,976 655,341 2%
21 51 1 04-01-2024 05-01-2024 35,504,037 28,543,211 6,960,826 20%
22 51 2 04-01-2024 06-01-2024 26,935,635 26,164,625 771,010 3%
23 51 3 04-01-2024 07-01-2024 19,175,494 14,860,194 4,315,300 23%
24 52 1 05-01-2024 06-01-2024 26,935,635 29,343,558 2,407,923 9%
25 52 2 05-01-2024 07-01-2024 19,175,494 15,632,127 3,543,367 18%
26 52 3 05-01-2024 08-01-2024 21,449,146 14,745,230 6,703,916 31%
27 53 1 06-01-2024 07-01-2024 19,175,494 15,005,852 4,169,642 22%
28 53 2 06-01-2024 08-01-2024 21,449,146 14,511,456 6,937,690 32%
29 53 3 06-01-2024 09-01-2024 23,828,291 14,480,263 9,348,028 39%
30 54 1 07-01-2024 08-01-2024 21,449,146 16,223,817 5,225,329 24%
31 54 2 07-01-2024 09-01-2024 23,828,291 15,324,391 8,503,900 36%
32 54 3 07-01-2024 10-01-2024 20,121,692 10,494,907 9,626,785 48%
33 55 1 08-01-2024 09-01-2024 23,828,291 17,382,719 6,445,572 27%
34 55 2 08-01-2024 10-01-2024 20,121,692 11,062,373 9,059,319 45%
35 55 3 08-01-2024 11-01-2024 23,311,088 8,674,143 14,636,945 63%
36 56 1 09-01-2024 10-01-2024 20,121,692 12,957,528 7,164,164 36%
37 56 2 09-01-2024 11-01-2024 23,311,088 9,499,250 13,811,838 59%
38 56 3 09-01-2024 12-01-2024 31,759,202 11,324,074 20,435,128 64%
In [127]:
sarima_log_manual_cv = sarima_log_manual_cv_df.groupby('horizon').agg(
    MAE=('AE', 'mean'),
    MAPE=('APE', 'mean'))

sarima_log_manual_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[127]:
  MAE MAPE
horizon    
1 8,355,908 24%
2 10,535,363 30%
3 11,919,988 33%

Performance is still looking better, although still higher than the 27% from other models:

In [128]:
sarima_log_manual_overall_MAE  = sarima_log_manual_cv_df['AE'].mean()
sarima_log_manual_overall_MAPE = sarima_log_manual_cv_df['APE'].mean()

print(f"MAE: {sarima_log_manual_overall_MAE:,.0f}")
print(f"MAPE (%): {sarima_log_manual_overall_MAPE:.0%}")
MAE: 10,270,420
MAPE (%): 29%

Below I assign the best performing model to have a "final" SARIMA model that performed the best, which was the short grid search ("auto" model) with the log-transformed data:

In [129]:
final_order = sarima_log_auto_model.model.order
final_seasonal_order = sarima_log_auto_model.model.seasonal_order
In [130]:
print(final_order)
print(final_seasonal_order)
(0, 1, 1)
(2, 1, 2, 12)
In [131]:
sarima_final_cv = sarima_log_auto_cv
In [132]:
sarima_final_cv.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}'})
Out[132]:
  MAE MAPE
horizon    
1 8,184,446 22%
2 9,970,251 27%
3 11,474,088 31%
In [133]:
sarima_final_overall_MAE = sarima_log_auto_overall_MAE
sarima_final_overall_MAPE = sarima_log_auto_overall_MAPE
In [134]:
print(f"MAE: {sarima_final_overall_MAE:,.0f}")
print(f"MAPE (%): {sarima_final_overall_MAPE:.0%}")
MAE: 9,876,261
MAPE (%): 27%

Final SARIMA plot:

(Note, removed outliers that were created from log transformation, as they skewed the view of the plot)

In [135]:
sar_final = sm.tsa.SARIMAX(
    np.log1p(data['Weight']),
    order=final_order,
    seasonal_order=final_seasonal_order,
    trend='c',
    use_exact_diffuse=True
).fit(disp=False)

fitted_sar_final_log = sar_final.fittedvalues

log_min, log_max = np.percentile(np.log1p(data['Weight']), [1, 99])
fitted_sar_final_log = np.clip(fitted_sar_final_log, log_min, log_max)
fitted_sar_final = np.expm1(fitted_sar_final_log)

fitted_dates = data['Import Date'].iloc[-len(fitted_sar_final):].reset_index(drop=True)

forecast_steps = 3
forecast_obj = sar_final.get_forecast(steps=forecast_steps)
forecast_mean_log = forecast_obj.predicted_mean
forecast_mean = np.expm1(forecast_mean_log)
conf_int_log = forecast_obj.conf_int(alpha=0.1)
lower = np.expm1(conf_int_log.iloc[:, 0])
upper = np.expm1(conf_int_log.iloc[:, 1])

last_date = data['Import Date'].iloc[-1]
future_dates = pd.date_range(start=last_date + pd.offsets.MonthBegin(1), periods=forecast_steps, freq='MS')

combined_preds_sar_final = np.concatenate([fitted_sar_final.values, forecast_mean.values])
combined_dates_sar_final = pd.concat([fitted_dates, pd.Series(future_dates)], ignore_index=True)

plt.figure(figsize=(12, 5))
plt.plot(data['Import Date'], data['Weight'], color=custom_palette[2], label="actual")
plt.plot(combined_dates_sar_final, combined_preds_sar_final, color=custom_palette[0], linestyle='--', label="model fit + forecast")
plt.fill_between(future_dates, lower, upper, color=custom_palette[0], alpha=0.2, label='80% Confidence Interval')
plt.axvspan(future_dates[0], future_dates[-1], color=custom_palette[0], alpha=0.1)
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1e6)}M'))
plt.legend(loc='upper left')
plt.title("SARIMA Model Final - Short Grid Search (Log Data)")
plt.xlabel("Date")
plt.ylabel("Weight")
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Model has good diagnostics now:

Diagnostics look better: white noise for residuals but histogram looks more normal, Q-Q plot looks more in-line, no autocorrelation:

In [136]:
sar_final.plot_diagnostics(figsize = (15,8),lags = 12);
No description has been provided for this image

Also checking further out lags (13-24) but see no autocorrelation:

In [137]:
sar_final.plot_diagnostics(figsize=(15,8), lags=24)
Out[137]:
No description has been provided for this image
No description has been provided for this image
In [138]:
sar_final.summary()
Out[138]:
SARIMAX Results
Dep. Variable: Weight No. Observations: 60
Model: SARIMAX(0, 1, 1)x(2, 1, [1, 2], 12) Log Likelihood -26.152
Date: Tue, 13 Jan 2026 AIC 92.304
Time: 16:44:55 BIC 134.190
Sample: 0 HQIC 108.688
- 60
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0003 0.024 0.014 0.989 -0.046 0.047
ma.L1 -0.3927 0.174 -2.261 0.024 -0.733 -0.052
ar.S.L12 0.0559 0.491 0.114 0.909 -0.907 1.019
ar.S.L24 -0.6244 0.245 -2.547 0.011 -1.105 -0.144
ma.S.L12 -1.6591 0.782 -2.122 0.034 -3.191 -0.127
ma.S.L24 0.7801 1.083 0.720 0.471 -1.342 2.902
sigma2 0.0312 0.032 0.973 0.331 -0.032 0.094
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 1.33
Prob(Q): 0.86 Prob(JB): 0.52
Heteroskedasticity (H): 0.63 Skew: 0.03
Prob(H) (two-sided): 0.37 Kurtosis: 2.18


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.34e+16. Standard errors may be unstable.

Statistical Tests also confirm good diagnostics and model fit:

In [139]:
norm_val, norm_p, skew, kurtosis = sar_final.test_normality('jarquebera')[0]
lb_val, lb_p = sar_final.test_serial_correlation(method='ljungbox',)[0]
het_val, het_p = sar_final.test_heteroskedasticity('breakvar')[0]

# Ljung-Box: look at largest lag
lb_val = lb_val[-1]
lb_p = lb_p[-1]
durbin_watson = sm.stats.stattools.durbin_watson(
    sar_final.filter_results.standardized_forecasts_error[0, sar_final.loglikelihood_burn:])

print('Normality: val={:.3f}, p={:.3f}'.format(norm_val, norm_p));
print('Ljung-Box: val={:.3f}, p={:.3f}'.format(lb_val, lb_p));
print('Heteroskedasticity: val={:.3f}, p={:.3f}'.format(het_val, het_p));
print('Durbin-Watson: d={:.2f}'.format(durbin_watson))
Normality: val=1.326, p=0.515
Ljung-Box: val=6.658, p=0.673
Heteroskedasticity: val=0.635, p=0.373
Durbin-Watson: d=1.95

Normality (Jarque-Bera): p > 0.05 → Normal residuals ✓ Ljung-Box (max lag): p > 0.05 → No autocorrelation ✓
Heteroskedasticity: p > 0.05 → Homoskedastic ✓ Durbin-Watson: d ≈2.0 → No serial correlation ✓

Model 3: Prophet¶

I also fit the Prophet model to see if it improves performance:

In [140]:
df=monthly_df.copy()
df['ds'] = df['Import Date']
df['y'] = df['Weight']
In [141]:
df.drop(df.columns[0:6], axis=1, inplace=True)
In [142]:
df.head()
Out[142]:
ds y
0 2020-01-01 77863938
1 2020-02-01 118967413
2 2020-03-01 117594405
3 2020-04-01 72523979
4 2020-05-01 58180114

Fit data with yearly seasonality (seasonality by month) as as initial baseline:

In [ ]:
m = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
m.fit(df)

Add future 3 months for forecast:

In [144]:
future = m.make_future_dataframe(periods=3, freq='MS')
print(future.head())
print(future.tail())
          ds
0 2020-01-01
1 2020-02-01
2 2020-03-01
3 2020-04-01
4 2020-05-01
           ds
58 2024-11-01
59 2024-12-01
60 2025-01-01
61 2025-02-01
62 2025-03-01

Forecast the next 3 months (with a CI):

In [145]:
forecast = m.predict(future)
print(forecast.columns)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'yearly', 'yearly_lower', 'yearly_upper', 'multiplicative_terms',
       'multiplicative_terms_lower', 'multiplicative_terms_upper', 'yhat'],
      dtype='object')
Out[145]:
ds yhat yhat_lower yhat_upper
58 2024-11-01 2.387980e+06 -1.588334e+07 2.176104e+07
59 2024-12-01 1.790853e+07 -1.263356e+06 3.646359e+07
60 2025-01-01 6.511386e+07 4.651289e+07 8.381275e+07
61 2025-02-01 3.929141e+07 1.980159e+07 5.699398e+07
62 2025-03-01 7.277707e+07 5.357060e+07 9.143605e+07
In [146]:
plt.figure(figsize=(12, 6))

plt.plot(df["ds"], df["y"], label="Actual", color=custom_palette[2])

plt.plot(forecast["ds"], forecast["yhat"], label="Forecast", color=custom_palette[0], linestyle='--')

future_forecast = forecast[forecast['ds'] > df['ds'].max()]
plt.fill_between(
    future_forecast['ds'],
    future_forecast['yhat_lower'],
    future_forecast['yhat_upper'],
    color=custom_palette[0], alpha=0.2, label='80% Confidence Interval'
)

ax = plt.gca()
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('{x:,.0f}'))

plt.legend()
plt.ylabel("Cocoa Bean Imports (lbs)")
plt.xlabel("Date")
plt.title("Prophet")
plt.show()
No description has been provided for this image

Prophet's Decomposition:

In [147]:
fig = m.plot_components(forecast)

for ax in fig.axes:
    for line in ax.get_lines():
        line.set_color('black')
plt.show()
No description has been provided for this image

Cross validation for predicting rolling 1-3 months out from training data:

In [148]:
period = "1MS"  # Month start
cutoffs = pd.date_range(start="2023-09-01", end="2024-07-01", freq=period) # End is last starting date
horizon = "93 days"  # Forecast horizon for each cutoff
prophet_cv = cross_validation(m, cutoffs=cutoffs, horizon=horizon, disable_tqdm=True)

prophet_cv['AE'] = abs(prophet_cv['yhat'] - prophet_cv['y'])
prophet_cv['APE'] = abs(prophet_cv['yhat'] - prophet_cv['y']) / abs(prophet_cv['y'])

prophet_cv['ds'] = pd.to_datetime(prophet_cv['ds'])
prophet_cv['cutoff'] = pd.to_datetime(prophet_cv['cutoff'])
prophet_cv['horizon'] = prophet_cv['ds'].dt.to_period('M').astype(int) - prophet_cv['cutoff'].dt.to_period('M').astype(int)

prophet_cv.style.format({'ds':'{:%m-%d-%Y}','cutoff':'{:%m-%d-%Y}','y':'{:,.0f}','yhat':'{:,.0f}','yhat_lower':'{:,.0f}','yhat_upper':'{:,.0f}','AE': '{:,.0f}', 'APE': '{:.2f}'})
16:44:56 - cmdstanpy - INFO - Chain [1] start processing
16:44:56 - cmdstanpy - INFO - Chain [1] done processing
16:44:56 - cmdstanpy - INFO - Chain [1] start processing
16:44:56 - cmdstanpy - INFO - Chain [1] done processing
16:44:56 - cmdstanpy - INFO - Chain [1] start processing
16:44:56 - cmdstanpy - INFO - Chain [1] done processing
16:44:56 - cmdstanpy - INFO - Chain [1] start processing
16:44:56 - cmdstanpy - INFO - Chain [1] done processing
16:44:56 - cmdstanpy - INFO - Chain [1] start processing
16:44:57 - cmdstanpy - INFO - Chain [1] done processing
16:44:57 - cmdstanpy - INFO - Chain [1] start processing
16:44:57 - cmdstanpy - INFO - Chain [1] done processing
16:44:57 - cmdstanpy - INFO - Chain [1] start processing
16:44:57 - cmdstanpy - INFO - Chain [1] done processing
16:44:57 - cmdstanpy - INFO - Chain [1] start processing
16:44:57 - cmdstanpy - INFO - Chain [1] done processing
16:44:57 - cmdstanpy - INFO - Chain [1] start processing
16:44:57 - cmdstanpy - INFO - Chain [1] done processing
16:44:57 - cmdstanpy - INFO - Chain [1] start processing
16:44:57 - cmdstanpy - INFO - Chain [1] done processing
16:44:57 - cmdstanpy - INFO - Chain [1] start processing
16:44:58 - cmdstanpy - INFO - Chain [1] done processing
Out[148]:
  ds yhat yhat_lower yhat_upper y cutoff AE APE horizon
0 10-01-2023 34,571,294 16,929,854 52,300,645 15,172,388 09-01-2023 19,398,906 1.28 1
1 11-01-2023 36,122,420 18,245,770 52,255,631 11,741,002 09-01-2023 24,381,418 2.08 2
2 12-01-2023 6,271,949 -9,598,850 23,445,941 17,305,596 09-01-2023 11,033,647 0.64 3
3 11-01-2023 39,374,964 23,385,461 56,874,108 11,741,002 10-01-2023 27,633,962 2.35 1
4 12-01-2023 8,371,796 -8,831,315 25,519,546 17,305,596 10-01-2023 8,933,800 0.52 2
5 01-01-2024 50,233,151 33,238,142 67,156,617 64,702,033 10-01-2023 14,468,882 0.22 3
6 12-01-2023 14,152,574 -3,117,462 31,062,282 17,305,596 11-01-2023 3,153,022 0.18 1
7 01-01-2024 49,940,755 32,061,018 68,962,455 64,702,033 11-01-2023 14,761,278 0.23 2
8 02-01-2024 78,604,042 60,051,070 95,263,367 65,375,575 11-01-2023 13,228,467 0.20 3
9 01-01-2024 49,872,672 32,245,314 67,248,874 64,702,033 12-01-2023 14,829,361 0.23 1
10 02-01-2024 78,598,462 61,289,389 97,393,705 65,375,575 12-01-2023 13,222,887 0.20 2
11 03-01-2024 106,659,231 89,317,085 123,659,483 62,364,919 12-01-2023 44,294,312 0.71 3
12 02-01-2024 79,579,136 62,051,163 96,549,506 65,375,575 01-01-2024 14,203,561 0.22 1
13 03-01-2024 108,929,837 91,319,256 126,149,664 62,364,919 01-01-2024 46,564,918 0.75 2
14 04-01-2024 64,384,252 48,206,371 81,351,503 34,933,233 01-01-2024 29,451,019 0.84 3
15 03-01-2024 106,524,646 88,859,720 123,044,772 62,364,919 02-01-2024 44,159,727 0.71 1
16 04-01-2024 62,391,175 44,772,206 79,368,691 34,933,233 02-01-2024 27,457,942 0.79 2
17 05-01-2024 52,155,620 34,999,897 69,397,706 35,504,037 02-01-2024 16,651,583 0.47 3
18 04-01-2024 62,532,257 44,824,399 82,380,783 34,933,233 03-01-2024 27,599,024 0.79 1
19 05-01-2024 51,432,949 32,080,191 68,912,081 35,504,037 03-01-2024 15,928,912 0.45 2
20 06-01-2024 51,258,120 32,798,947 70,240,545 26,935,635 03-01-2024 24,322,485 0.90 3
21 05-01-2024 51,418,592 33,825,254 69,345,604 35,504,037 04-01-2024 15,914,555 0.45 1
22 06-01-2024 50,860,837 31,374,903 69,865,398 26,935,635 04-01-2024 23,925,202 0.89 2
23 07-01-2024 20,165,735 1,083,522 38,248,827 19,175,494 04-01-2024 990,241 0.05 3
24 06-01-2024 50,956,609 32,704,522 70,184,129 26,935,635 05-01-2024 24,020,974 0.89 1
25 07-01-2024 20,052,688 1,345,047 38,428,082 19,175,494 05-01-2024 877,194 0.05 2
26 08-01-2024 11,858,470 -5,635,972 31,271,205 21,449,146 05-01-2024 9,590,676 0.45 3
27 07-01-2024 20,360,144 3,062,241 39,273,490 19,175,494 06-01-2024 1,184,650 0.06 1
28 08-01-2024 11,755,945 -5,622,807 30,095,651 21,449,146 06-01-2024 9,693,201 0.45 2
29 09-01-2024 21,479,723 3,124,096 39,337,768 23,828,291 06-01-2024 2,348,568 0.10 3
30 08-01-2024 11,759,345 -5,355,922 31,470,017 21,449,146 07-01-2024 9,689,801 0.45 1
31 09-01-2024 21,470,890 3,739,985 39,615,374 23,828,291 07-01-2024 2,357,401 0.10 2
32 10-01-2024 -2,962,042 -20,434,241 15,681,012 20,121,692 07-01-2024 23,083,734 1.15 3
In [149]:
MAE_by_horizon = prophet_cv.groupby(['horizon'])['AE'].mean()
MAE_by_horizon = pd.DataFrame(MAE_by_horizon).rename(columns={'AE': 'MAE'})

MAPE_by_horizon = prophet_cv.groupby(['horizon'])['APE'].mean()
MAPE_by_horizon = pd.DataFrame(MAPE_by_horizon).rename(columns={'APE': 'MAPE'})

prophet_horizon_table = MAE_by_horizon.merge(MAPE_by_horizon, on='horizon')
prophet_horizon_table.style.format({'MAE': '{:,.0f}', 'MAPE': '{:.0%}'})
Out[149]:
  MAE MAPE
horizon    
1 18,344,322 69%
2 17,100,378 59%
3 17,223,965 52%
In [150]:
prophet_p = performance_metrics(prophet_cv, rolling_window=1)
prophet_p.style.format({
    'mse': '{:,.0f}',
    'rmse': '{:,.0f}',
    'mae': '{:,.0f}',
    'mape': '{:.0%}',
    'coverage': '{:.0%}',
    'mdape': '{:.0%}',
    'smape': '{:.0%}',
    'coverage' : '{:.0%}'
})
Out[150]:
  horizon mse rmse mae mape mdape smape coverage
0 92 days 00:00:00 453,339,460,451,128 21,291,770 17,556,222 60% 45% 50% 61%
In [151]:
prophet_overall_MAE  = prophet_cv['AE'].mean()
prophet_overall_MAPE = prophet_cv['APE'].mean()

print(f"MAE:{prophet_overall_MAE:,.0f}")
print(f"MAPE (%): {prophet_overall_MAPE:.0%}")
MAE:17,556,222
MAPE (%): 60%

Tuning the model:

Grid Search with hyperparameter options:

In [ ]:
param_grid = {
    'changepoint_prior_scale': [0.001, 0.01, 0.02],
    'changepoint_range': [0.8, 0.95],
    'seasonality_prior_scale': [0.1, 1.0, 5.0],
    'seasonality_mode': ['additive', 'multiplicative'],
    'growth': ['linear', 'logistic']
}

all_params = list(itertools.product(*param_grid.values()))
rmses = []

for params in all_params:
    param_dict = dict(zip(param_grid.keys(), params))
    temp_df = df.copy()

    if param_dict['growth'] == 'logistic':
        temp_df['floor'] = 0
        temp_df['cap'] = temp_df['y'].max() * 1.2

    model = Prophet(
        yearly_seasonality=True,
        weekly_seasonality=False,
        daily_seasonality=False,
        **param_dict
    )
    model.fit(temp_df)

    df_cv = cross_validation(model, cutoffs=cutoffs, horizon='93 days', disable_tqdm=True)
    df_p = performance_metrics(df_cv, rolling_window=1)
    rmses.append(df_p['rmse'].mean())

tuning_results = pd.DataFrame([dict(zip(param_grid.keys(), p)) for p in all_params])
tuning_results['rmse'] = rmses
print(tuning_results.nsmallest(1, 'rmse'))

Best parameters:

In [153]:
best_params_prep = all_params[np.argmin(rmses)]
best_params = dict(zip(param_grid.keys(), best_params_prep))

print(best_params)
{'changepoint_prior_scale': 0.01, 'changepoint_range': 0.8, 'seasonality_prior_scale': 0.1, 'seasonality_mode': 'multiplicative', 'growth': 'linear'}

Plot model fit and forecast, including floor and cap since seasonality mode chose 'multiplicative' - because it needs guardrails:

In [154]:
cap_value = df['y'].max() * 1.2

df = df.copy()
df['floor'] = 0
df['cap'] = cap_value

m = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=False,
    daily_seasonality=False,
    **best_params
)

m.fit(df)

future = m.make_future_dataframe(periods=3, freq='MS')
future['floor'] = 0
future['cap'] = cap_value

forecast = m.predict(future)

print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])

plt.figure(figsize=(12, 6))

plt.plot(df["ds"], df["y"], label="Actual", color=custom_palette[2])

plt.plot(forecast["ds"], forecast["yhat"], label="Forecast", color=custom_palette[0], linestyle='--')

future_forecast = forecast[forecast['ds'] > df['ds'].max()]
plt.fill_between(
    future_forecast['ds'],
    future_forecast['yhat_lower'],
    future_forecast['yhat_upper'],
    color=custom_palette[0], alpha=0.2, label='80% Confidence Interval'
)

ax = plt.gca()
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('{x:,.0f}'))

plt.legend()
plt.ylabel("Cocoa Bean Imports (lbs)")
plt.xlabel("Date")
plt.title("Prophet (Tuned)")
plt.show()
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:17 - cmdstanpy - INFO - Chain [1] done processing
           ds          yhat    yhat_lower    yhat_upper
0  2020-01-01  9.962309e+07  7.728316e+07  1.192664e+08
1  2020-02-01  9.822106e+07  7.851117e+07  1.183878e+08
2  2020-03-01  1.212633e+08  1.003751e+08  1.409421e+08
3  2020-04-01  1.050092e+08  8.537475e+07  1.263587e+08
4  2020-05-01  8.638201e+07  6.760903e+07  1.061323e+08
..        ...           ...           ...           ...
58 2024-11-01  1.970418e+07  4.289487e+05  4.089143e+07
59 2024-12-01  2.307426e+07  4.615000e+06  4.479443e+07
60 2025-01-01  4.070195e+07  1.941385e+07  5.989525e+07
61 2025-02-01  3.952993e+07  1.908779e+07  5.936743e+07
62 2025-03-01  4.800308e+07  2.820561e+07  6.871246e+07

[63 rows x 4 columns]
No description has been provided for this image
In [155]:
fig = m.plot_components(forecast)

for ax in fig.axes:
    for line in ax.get_lines():
        line.set_color('black')
plt.show()
No description has been provided for this image

Cross validation for predicting rolling 1-3 months out from training data:

In [156]:
period = "1MS"  # Month start
cutoffs = pd.date_range(start="2023-09-01", end="2024-08-01", freq=period) # End is last starting date
horizon = "92 days"  # Forecast horizon for each cutoff
prophet_tuned_cv = cross_validation(m, cutoffs=cutoffs, horizon=horizon, disable_tqdm=True)

prophet_tuned_cv['AE'] = abs(prophet_tuned_cv['yhat'] - prophet_tuned_cv['y'])
prophet_tuned_cv['APE'] = (abs(prophet_tuned_cv['yhat'] - prophet_tuned_cv['y']) / abs(prophet_tuned_cv['y']))

prophet_tuned_cv['ds'] = pd.to_datetime(prophet_tuned_cv['ds'])
prophet_tuned_cv['cutoff'] = pd.to_datetime(prophet_tuned_cv['cutoff'])
prophet_tuned_cv['horizon'] = prophet_tuned_cv['ds'].dt.to_period('M').astype(int) - prophet_tuned_cv['cutoff'].dt.to_period('M').astype(int)
prophet_tuned_cv.style.format({'ds':'{:%m-%d-%Y}','cutoff':'{:%m-%d-%Y}','y':'{:,.0f}','yhat':'{:,.0f}','yhat_lower':'{:,.0f}','yhat_upper':'{:,.0f}','AE': '{:,.0f}', 'APE': '{:.0%}'})
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:17 - cmdstanpy - INFO - Chain [1] done processing
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:17 - cmdstanpy - INFO - Chain [1] done processing
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:17 - cmdstanpy - INFO - Chain [1] done processing
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:17 - cmdstanpy - INFO - Chain [1] done processing
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:17 - cmdstanpy - INFO - Chain [1] done processing
16:48:17 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
16:48:18 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
16:48:18 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
16:48:18 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
16:48:18 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
16:48:18 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
16:48:18 - cmdstanpy - INFO - Chain [1] start processing
16:48:18 - cmdstanpy - INFO - Chain [1] done processing
Out[156]:
  ds yhat yhat_lower yhat_upper y cutoff AE APE horizon
0 10-01-2023 33,229,787 11,664,038 55,107,299 15,172,388 09-01-2023 18,057,399 119% 1
1 11-01-2023 29,595,579 7,474,233 52,542,841 11,741,002 09-01-2023 17,854,577 152% 2
2 12-01-2023 33,518,306 11,316,066 56,809,617 17,305,596 09-01-2023 16,212,710 94% 3
3 11-01-2023 29,070,280 7,444,246 51,398,724 11,741,002 10-01-2023 17,329,278 148% 1
4 12-01-2023 32,972,679 11,906,542 55,297,136 17,305,596 10-01-2023 15,667,083 91% 2
5 01-01-2024 55,687,186 33,451,409 79,134,164 64,702,033 10-01-2023 9,014,847 14% 3
6 12-01-2023 32,629,136 10,178,688 54,321,667 17,305,596 11-01-2023 15,323,540 89% 1
7 01-01-2024 54,859,769 32,539,603 77,908,837 64,702,033 11-01-2023 9,842,264 15% 2
8 02-01-2024 53,190,108 30,502,895 74,613,935 65,375,575 11-01-2023 12,185,467 19% 3
9 01-01-2024 53,960,554 31,453,828 76,575,435 64,702,033 12-01-2023 10,741,479 17% 1
10 02-01-2024 52,331,759 27,977,161 72,979,562 65,375,575 12-01-2023 13,043,816 20% 2
11 03-01-2024 65,484,230 44,422,431 87,358,738 62,364,919 12-01-2023 3,119,311 5% 3
12 02-01-2024 53,740,013 31,782,154 75,018,236 65,375,575 01-01-2024 11,635,562 18% 1
13 03-01-2024 67,386,820 44,595,487 89,681,636 62,364,919 01-01-2024 5,021,901 8% 2
14 04-01-2024 59,994,313 39,978,217 81,240,379 34,933,233 01-01-2024 25,061,080 72% 3
15 03-01-2024 69,350,348 49,191,804 89,845,352 62,364,919 02-01-2024 6,985,429 11% 1
16 04-01-2024 61,888,080 40,804,267 85,041,513 34,933,233 02-01-2024 26,954,847 77% 2
17 05-01-2024 49,866,421 28,641,993 71,141,075 35,504,037 02-01-2024 14,362,384 40% 3
18 04-01-2024 60,571,527 38,448,887 82,950,568 34,933,233 03-01-2024 25,638,294 73% 1
19 05-01-2024 48,773,749 26,712,346 69,985,555 35,504,037 03-01-2024 13,269,712 37% 2
20 06-01-2024 45,989,469 24,985,167 66,965,842 26,935,635 03-01-2024 19,053,834 71% 3
21 05-01-2024 46,132,527 23,910,277 68,482,260 35,504,037 04-01-2024 10,628,490 30% 1
22 06-01-2024 43,490,712 21,942,703 65,468,718 26,935,635 04-01-2024 16,555,077 61% 2
23 07-01-2024 28,107,791 4,560,907 48,738,353 19,175,494 04-01-2024 8,932,297 47% 3
24 06-01-2024 42,796,734 20,858,750 63,711,689 26,935,635 05-01-2024 15,861,099 59% 1
25 07-01-2024 27,625,300 5,703,997 49,364,851 19,175,494 05-01-2024 8,449,806 44% 2
26 08-01-2024 26,912,706 6,032,490 46,924,072 21,449,146 05-01-2024 5,463,560 25% 3
27 07-01-2024 27,330,458 6,202,446 49,208,356 19,175,494 06-01-2024 8,154,964 43% 1
28 08-01-2024 26,610,724 5,113,688 47,524,449 21,449,146 06-01-2024 5,161,578 24% 2
29 09-01-2024 30,252,317 8,513,849 52,273,577 23,828,291 06-01-2024 6,424,026 27% 3
30 08-01-2024 25,616,407 4,133,047 46,448,137 21,449,146 07-01-2024 4,167,261 19% 1
31 09-01-2024 28,990,145 7,503,903 50,644,769 23,828,291 07-01-2024 5,161,854 22% 2
32 10-01-2024 22,165,792 2,011,392 43,446,943 20,121,692 07-01-2024 2,044,100 10% 3
33 09-01-2024 28,701,339 10,205,185 48,470,442 23,828,291 08-01-2024 4,873,048 20% 1
34 10-01-2024 21,676,952 1,194,331 40,322,400 20,121,692 08-01-2024 1,555,260 8% 2
35 11-01-2024 18,732,463 -853,543 40,308,048 23,311,088 08-01-2024 4,578,625 20% 3
In [157]:
MAE_by_horizon = prophet_tuned_cv.groupby(['horizon'])['AE'].mean()
MAE_by_horizon = pd.DataFrame(MAE_by_horizon).rename(columns={'AE': 'MAE'})

MAPE_by_horizon = prophet_tuned_cv.groupby(['horizon'])['APE'].mean()
MAPE_by_horizon = pd.DataFrame(MAPE_by_horizon).rename(columns={'APE': 'MAPE'})

prophet_tuned_horizon_table = MAE_by_horizon.merge(MAPE_by_horizon,on='horizon')
prophet_tuned_horizon_table.style.format({'MAE': '{:,.0f}', 'MAPE': '{:.0%}'})
Out[157]:
  MAE MAPE
horizon    
1 12,449,653 54%
2 11,544,815 47%
3 10,537,687 37%
In [158]:
prophet_tuned_p = performance_metrics(prophet_tuned_cv, rolling_window=1)
prophet_tuned_p.style.format({
    'mse': '{:,.0f}',
    'rmse': '{:,.0f}',
    'mae': '{:,.0f}',
    'mape': '{:.0%}',
    'coverage': '{:.0%}',
    'mdape': '{:.0%}',
    'smape': '{:.0%}',
    'coverage' : '{:.0%}'
})
Out[158]:
  horizon mse rmse mae mape mdape smape coverage
0 92 days 00:00:00 175,730,234,379,226 13,256,328 11,510,718 46% 28% 34% 92%

Performance is not great:

In [159]:
prophet_tuned_overall_MAE  = prophet_tuned_cv['AE'].mean()
prophet_tuned_overall_MAPE = prophet_tuned_cv['APE'].mean()

print(f"MAE: {prophet_tuned_overall_MAE:,.0f}")
print(f"MAPE (%): {prophet_tuned_overall_MAPE:.0%}")
MAE: 11,510,718
MAPE (%): 46%

Prophet with log transformation:

In [160]:
df_log = df.copy()

df_log['y'] = np.log1p(df_log['y'])
df_log['floor'] = np.log1p(df_log['floor'])
df_log['cap'] = np.log1p(df_log['cap'])

df_log.head()
Out[160]:
ds y floor cap
0 2020-01-01 18.170473 0.0 19.037795
1 2020-02-01 18.594360 0.0 19.037795
2 2020-03-01 18.582752 0.0 19.037795
3 2020-04-01 18.099428 0.0 19.037795
4 2020-05-01 17.879054 0.0 19.037795
In [161]:
df_log.head()
Out[161]:
ds y floor cap
0 2020-01-01 18.170473 0.0 19.037795
1 2020-02-01 18.594360 0.0 19.037795
2 2020-03-01 18.582752 0.0 19.037795
3 2020-04-01 18.099428 0.0 19.037795
4 2020-05-01 17.879054 0.0 19.037795
In [ ]:
param_grid_log = {
    'changepoint_prior_scale': [0.001, 0.01, 0.02],
    'changepoint_range': [0.8, 0.95],
    'seasonality_prior_scale': [0.1, 1.0, 5.0],
    'seasonality_mode': ['additive', 'multiplicative'],
    'growth': ['linear', 'logistic']
}

all_params_log = list(itertools.product(*param_grid_log.values()))
rmses_log = []

for params in all_params_log:
    param_dict_log = dict(zip(param_grid_log.keys(), params))

    if param_dict_log['growth'] == 'logistic':
        df_log['floor'] = 0
        df_log['cap'] = df_log['y'].max() * 1.2

    model_log = Prophet(
        yearly_seasonality=True,
        weekly_seasonality=False,
        daily_seasonality=False,
        **param_dict_log
    )
    model_log.fit(df_log)

    cv_log = cross_validation(model_log, cutoffs=cutoffs, horizon='93 days', disable_tqdm=True)

    cv_orig = cv_log.copy()
    cv_orig['y_orig'] = np.expm1(cv_log['y'])
    cv_orig['yhat_orig'] = np.expm1(cv_log['yhat'])
    df_p_orig = performance_metrics(cv_orig, rolling_window=1)
    rmses_log.append(df_p_orig['rmse'].mean())

tuning_results_log = pd.DataFrame([dict(zip(param_grid_log.keys(), p)) for p in all_params_log])
tuning_results_log['rmse'] = rmses_log
print(tuning_results_log.nsmallest(1, 'rmse'))
In [163]:
best_params_prep_log = all_params_log[np.argmin(rmses_log)]
best_params_log = dict(zip(param_grid_log.keys(), best_params_prep_log))

print(best_params_log)
{'changepoint_prior_scale': 0.02, 'changepoint_range': 0.95, 'seasonality_prior_scale': 0.1, 'seasonality_mode': 'multiplicative', 'growth': 'logistic'}
In [164]:
cv_orig['AE'] = abs(cv_orig['yhat_orig'] - cv_orig['y_orig'])
cv_orig['APE'] = abs(cv_orig['yhat_orig'] - cv_orig['y_orig']) / abs(cv_orig['y_orig'])
In [165]:
cv_orig.style.format({
    'ds':'{:%m-%d-%Y}',
    'y':'{:,.2f}',
    'yhat':'{:,.2f}',
    'yhat_lower':'{:,.2f}',
    'yhat_upper':'{:,.2f}',
    'y_orig':'{:,.0f}',
    'cutoff':'{:%m-%d-%Y}',
    'yhat_orig':'{:,.0f}',
    'yhat_lower_orig':'{:,.0f}',
    'yhat_upper_orig':'{:,.0f}',
    'AE': '{:,.0f}',
    'APE': '{:.0%}'
})
Out[165]:
  ds yhat yhat_lower yhat_upper y cutoff y_orig yhat_orig AE APE
0 10-01-2023 17.47 17.20 17.71 16.53 09-01-2023 15,172,388 38,590,845 23,418,457 154%
1 11-01-2023 17.49 17.24 17.73 16.28 09-01-2023 11,741,002 39,427,067 27,686,065 236%
2 12-01-2023 16.42 16.16 16.68 16.67 09-01-2023 17,305,596 13,487,256 3,818,340 22%
3 11-01-2023 17.64 17.35 17.92 16.28 10-01-2023 11,741,002 45,870,173 34,129,171 291%
4 12-01-2023 16.51 16.23 16.78 16.67 10-01-2023 17,305,596 14,823,024 2,482,572 14%
5 01-01-2024 17.57 17.26 17.82 17.99 10-01-2023 64,702,033 42,506,011 22,196,022 34%
6 12-01-2023 16.81 16.51 17.15 16.67 11-01-2023 17,305,596 19,974,901 2,669,305 15%
7 01-01-2024 17.57 17.24 17.89 17.99 11-01-2023 64,702,033 42,553,675 22,148,358 34%
8 02-01-2024 17.75 17.43 18.06 18.00 11-01-2023 65,375,575 51,167,470 14,208,105 22%
9 01-01-2024 17.58 17.25 17.90 17.99 12-01-2023 64,702,033 43,184,329 21,517,704 33%
10 02-01-2024 17.76 17.45 18.07 18.00 12-01-2023 65,375,575 51,784,211 13,591,364 21%
11 03-01-2024 18.06 17.74 18.35 17.95 12-01-2023 62,364,919 69,550,697 7,185,778 12%
12 02-01-2024 17.82 17.49 18.14 18.00 01-01-2024 65,375,575 54,631,649 10,743,926 16%
13 03-01-2024 18.14 17.82 18.43 17.95 01-01-2024 62,364,919 75,588,776 13,223,857 21%
14 04-01-2024 17.74 17.43 18.05 17.37 01-01-2024 34,933,233 50,672,081 15,738,848 45%
15 03-01-2024 18.16 17.84 18.49 17.95 02-01-2024 62,364,919 77,281,349 14,916,430 24%
16 04-01-2024 17.76 17.44 18.07 17.37 02-01-2024 34,933,233 51,536,485 16,603,252 48%
17 05-01-2024 17.61 17.30 17.91 17.39 02-01-2024 35,504,037 44,366,014 8,861,977 25%
18 04-01-2024 17.76 17.44 18.08 17.37 03-01-2024 34,933,233 51,504,978 16,571,745 47%
19 05-01-2024 17.60 17.27 17.90 17.39 03-01-2024 35,504,037 44,167,722 8,663,685 24%
20 06-01-2024 17.60 17.28 17.95 17.11 03-01-2024 26,935,635 44,130,722 17,195,087 64%
21 05-01-2024 17.59 17.26 17.89 17.39 04-01-2024 35,504,037 43,407,845 7,903,808 22%
22 06-01-2024 17.58 17.26 17.90 17.11 04-01-2024 26,935,635 43,166,578 16,230,943 60%
23 07-01-2024 17.07 16.74 17.37 16.77 04-01-2024 19,175,494 25,913,220 6,737,726 35%
24 06-01-2024 17.58 17.27 17.89 17.11 05-01-2024 26,935,635 43,273,502 16,337,867 61%
25 07-01-2024 17.07 16.77 17.39 16.77 05-01-2024 19,175,494 25,927,086 6,751,592 35%
26 08-01-2024 16.84 16.51 17.13 16.88 05-01-2024 21,449,146 20,524,610 924,536 4%
27 07-01-2024 17.07 16.75 17.39 16.77 06-01-2024 19,175,494 25,962,683 6,787,189 35%
28 08-01-2024 16.83 16.51 17.14 16.88 06-01-2024 21,449,146 20,406,110 1,043,036 5%
29 09-01-2024 17.04 16.72 17.37 16.99 06-01-2024 23,828,291 25,145,438 1,317,147 6%
30 08-01-2024 16.83 16.51 17.13 16.88 07-01-2024 21,449,146 20,314,748 1,134,398 5%
31 09-01-2024 17.03 16.69 17.35 16.99 07-01-2024 23,828,291 24,945,848 1,117,557 5%
32 10-01-2024 16.48 16.18 16.81 16.82 07-01-2024 20,121,692 14,391,758 5,729,934 28%
33 09-01-2024 17.04 16.72 17.36 16.99 08-01-2024 23,828,291 25,143,218 1,314,927 6%
34 10-01-2024 16.49 16.17 16.81 16.82 08-01-2024 20,121,692 14,497,193 5,624,499 28%
35 11-01-2024 16.29 15.96 16.61 16.96 08-01-2024 23,311,088 11,830,985 11,480,103 49%
In [166]:
m_log = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=False,
    daily_seasonality=False,
    **best_params_log
)

m_log.fit(df_log)

future_log = m_log.make_future_dataframe(periods=3, freq='MS')
future_log['floor'] = np.log1p(0)
future_log['cap'] = np.log1p(cap_value)

forecast_log = m_log.predict(future_log)

forecast_log['yhat_orig'] = np.expm1(forecast_log['yhat'])
forecast_log['yhat_lower_orig'] = np.expm1(forecast_log['yhat_lower'])
forecast_log['yhat_upper_orig'] = np.expm1(forecast_log['yhat_upper'])

plt.figure(figsize=(12, 6))

plt.plot(df['ds'], df['y'], label="Actual", color=custom_palette[2])

plt.plot(forecast_log['ds'], forecast_log['yhat_orig'], label="Forecast", color=custom_palette[0], linestyle='--')

future_forecast_log = forecast_log[forecast_log['ds'] > df_log['ds'].max()]
plt.fill_between(
    future_forecast_log['ds'],
    future_forecast_log['yhat_lower_orig'],
    future_forecast_log['yhat_upper_orig'],
    color=custom_palette[0],
    alpha=0.2,
    label='80% Confidence Interval'
)

ax = plt.gca()
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('{x:,.0f}'))

plt.legend()
plt.ylabel("Cocoa Bean Imports (lbs)")
plt.xlabel("Date")
plt.title("Prophet (Tuned - Log Data)")
plt.show()
16:51:30 - cmdstanpy - INFO - Chain [1] start processing
16:51:30 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
In [167]:
fig = m_log.plot_components(forecast_log)

for ax in fig.axes:
    for line in ax.get_lines():
        line.set_color('black')
plt.show()
No description has been provided for this image
In [168]:
prophet_tuned_log_cv= cv_orig

prophet_tuned_log_cv['AE'] = abs(prophet_tuned_log_cv['yhat_orig'] - prophet_tuned_log_cv['y_orig'])
prophet_tuned_log_cv['APE'] = (abs(prophet_tuned_log_cv['yhat_orig'] - prophet_tuned_log_cv['y_orig']) / abs(prophet_tuned_log_cv['y_orig']))

prophet_tuned_log_cv['ds'] = pd.to_datetime(prophet_tuned_log_cv['ds'])
prophet_tuned_log_cv['cutoff'] = pd.to_datetime(prophet_tuned_log_cv['cutoff'])
prophet_tuned_log_cv['horizon'] = prophet_tuned_log_cv['ds'].dt.to_period('M').astype(int) - prophet_tuned_log_cv['cutoff'].dt.to_period('M').astype(int)
prophet_tuned_log_cv.style.format({'ds':'{:%m-%d-%Y}','cutoff':'{:%m-%d-%Y}','y':'{:,.2f}','y_orig':'{:,.0f}','yhat':'{:,.2f}','yhat_lower':'{:,.2f}',
                                   'yhat_upper':'{:,.2f}','yhat_orig':'{:,.0f}','yhat_lower_orig':'{:,.0f}',
                                   'yhat_upper_orig':'{:,.0f}','AE': '{:,.0f}', 'APE': '{:.0%}'})
Out[168]:
  ds yhat yhat_lower yhat_upper y cutoff y_orig yhat_orig AE APE horizon
0 10-01-2023 17.47 17.20 17.71 16.53 09-01-2023 15,172,388 38,590,845 23,418,457 154% 1
1 11-01-2023 17.49 17.24 17.73 16.28 09-01-2023 11,741,002 39,427,067 27,686,065 236% 2
2 12-01-2023 16.42 16.16 16.68 16.67 09-01-2023 17,305,596 13,487,256 3,818,340 22% 3
3 11-01-2023 17.64 17.35 17.92 16.28 10-01-2023 11,741,002 45,870,173 34,129,171 291% 1
4 12-01-2023 16.51 16.23 16.78 16.67 10-01-2023 17,305,596 14,823,024 2,482,572 14% 2
5 01-01-2024 17.57 17.26 17.82 17.99 10-01-2023 64,702,033 42,506,011 22,196,022 34% 3
6 12-01-2023 16.81 16.51 17.15 16.67 11-01-2023 17,305,596 19,974,901 2,669,305 15% 1
7 01-01-2024 17.57 17.24 17.89 17.99 11-01-2023 64,702,033 42,553,675 22,148,358 34% 2
8 02-01-2024 17.75 17.43 18.06 18.00 11-01-2023 65,375,575 51,167,470 14,208,105 22% 3
9 01-01-2024 17.58 17.25 17.90 17.99 12-01-2023 64,702,033 43,184,329 21,517,704 33% 1
10 02-01-2024 17.76 17.45 18.07 18.00 12-01-2023 65,375,575 51,784,211 13,591,364 21% 2
11 03-01-2024 18.06 17.74 18.35 17.95 12-01-2023 62,364,919 69,550,697 7,185,778 12% 3
12 02-01-2024 17.82 17.49 18.14 18.00 01-01-2024 65,375,575 54,631,649 10,743,926 16% 1
13 03-01-2024 18.14 17.82 18.43 17.95 01-01-2024 62,364,919 75,588,776 13,223,857 21% 2
14 04-01-2024 17.74 17.43 18.05 17.37 01-01-2024 34,933,233 50,672,081 15,738,848 45% 3
15 03-01-2024 18.16 17.84 18.49 17.95 02-01-2024 62,364,919 77,281,349 14,916,430 24% 1
16 04-01-2024 17.76 17.44 18.07 17.37 02-01-2024 34,933,233 51,536,485 16,603,252 48% 2
17 05-01-2024 17.61 17.30 17.91 17.39 02-01-2024 35,504,037 44,366,014 8,861,977 25% 3
18 04-01-2024 17.76 17.44 18.08 17.37 03-01-2024 34,933,233 51,504,978 16,571,745 47% 1
19 05-01-2024 17.60 17.27 17.90 17.39 03-01-2024 35,504,037 44,167,722 8,663,685 24% 2
20 06-01-2024 17.60 17.28 17.95 17.11 03-01-2024 26,935,635 44,130,722 17,195,087 64% 3
21 05-01-2024 17.59 17.26 17.89 17.39 04-01-2024 35,504,037 43,407,845 7,903,808 22% 1
22 06-01-2024 17.58 17.26 17.90 17.11 04-01-2024 26,935,635 43,166,578 16,230,943 60% 2
23 07-01-2024 17.07 16.74 17.37 16.77 04-01-2024 19,175,494 25,913,220 6,737,726 35% 3
24 06-01-2024 17.58 17.27 17.89 17.11 05-01-2024 26,935,635 43,273,502 16,337,867 61% 1
25 07-01-2024 17.07 16.77 17.39 16.77 05-01-2024 19,175,494 25,927,086 6,751,592 35% 2
26 08-01-2024 16.84 16.51 17.13 16.88 05-01-2024 21,449,146 20,524,610 924,536 4% 3
27 07-01-2024 17.07 16.75 17.39 16.77 06-01-2024 19,175,494 25,962,683 6,787,189 35% 1
28 08-01-2024 16.83 16.51 17.14 16.88 06-01-2024 21,449,146 20,406,110 1,043,036 5% 2
29 09-01-2024 17.04 16.72 17.37 16.99 06-01-2024 23,828,291 25,145,438 1,317,147 6% 3
30 08-01-2024 16.83 16.51 17.13 16.88 07-01-2024 21,449,146 20,314,748 1,134,398 5% 1
31 09-01-2024 17.03 16.69 17.35 16.99 07-01-2024 23,828,291 24,945,848 1,117,557 5% 2
32 10-01-2024 16.48 16.18 16.81 16.82 07-01-2024 20,121,692 14,391,758 5,729,934 28% 3
33 09-01-2024 17.04 16.72 17.36 16.99 08-01-2024 23,828,291 25,143,218 1,314,927 6% 1
34 10-01-2024 16.49 16.17 16.81 16.82 08-01-2024 20,121,692 14,497,193 5,624,499 28% 2
35 11-01-2024 16.29 15.96 16.61 16.96 08-01-2024 23,311,088 11,830,985 11,480,103 49% 3
In [169]:
MAE_by_horizon_log = prophet_tuned_log_cv.groupby(['horizon'])['AE'].mean()
MAE_by_horizon_log = pd.DataFrame(MAE_by_horizon_log).rename(columns={'AE': 'MAE'})

MAPE_by_horizon_log = prophet_tuned_log_cv.groupby(['horizon'])['APE'].mean()
MAPE_by_horizon_log = pd.DataFrame(MAPE_by_horizon_log).rename(columns={'APE': 'MAPE'})

prophet_tuned_log_horizon_table = MAE_by_horizon_log.merge(MAPE_by_horizon_log,on='horizon')
prophet_tuned_log_horizon_table.style.format({'MAE': '{:,.0f}', 'MAPE': '{:.0%}'})
Out[169]:
  MAE MAPE
horizon    
1 13,120,411 59%
2 11,263,898 44%
3 9,616,134 29%

The log transformation did not meaningfully improve the performance:

In [170]:
prophet_tuned_log_overall_MAE  = prophet_tuned_log_cv['AE'].mean()
prophet_tuned_log_overall_MAPE = prophet_tuned_log_cv['APE'].mean()

print(f"MAE: {prophet_tuned_log_overall_MAE:,.0f}")
print(f"MAPE (%): {prophet_tuned_log_overall_MAPE:.0%}")
MAE: 11,333,481
MAPE (%): 44%

Alt text

Overall Model Metrics:

Note, only includes the tuned models above:

In [171]:
final_table = pd.DataFrame({
    'Model': [
        'ExponentialSmoothing','ExponentialSmoothing',
        'Sarima','Sarima','Sarima','Sarima','Sarima','Sarima',
        'Prophet','Prophet'
    ],
    'Log Data': ['N','Y','N','N','N','Y','Y','Y','N','Y'],
    'Tuning': [
        'Manual','Manual',
        'Short Grid Search','Grid Search','Manual',
        'Short Grid Search','Grid Search','Manual',
        'Grid Search','Grid Search'
    ],

    'MAE': [
        exp_ts_tuned_overall_MAE,
        exp_ts_tuned_log_overall_MAE,
        sarima_auto_overall_MAE,
        sarima_flex_full_overall_MAE,
        sarima_manual_overall_MAE,
        sarima_log_auto_overall_MAE,
        sarima_log_full_flex_overall_MAE,
        sarima_log_manual_overall_MAE,
        prophet_tuned_overall_MAE,
        prophet_tuned_log_overall_MAE
    ],

    'MAPE': [
        exp_ts_tuned_overall_MAPE,
        exp_ts_tuned_log_overall_MAPE,
        sarima_auto_overall_MAPE,
        sarima_flex_full_overall_MAPE,
        sarima_manual_overall_MAPE,
        sarima_log_auto_overall_MAPE,
        sarima_log_full_flex_overall_MAPE,
        sarima_log_manual_overall_MAPE,
        prophet_tuned_overall_MAPE,
        prophet_tuned_log_overall_MAPE
    ],

    'MAE_H1': [
        exp_ts_tuned_cv['MAE'][1],
        exp_ts_tuned_log_cv['MAE'][1],
        sarima_auto_cv['MAE'][1],
        sarima_flex_full_cv['MAE'][1],
        sarima_manual_cv['MAE'][1],
        sarima_log_auto_cv['MAE'][1],
        sarima_log_full_flex_cv['MAE'][1],
        sarima_log_manual_cv['MAE'][1],
        prophet_tuned_horizon_table['MAE'][1],
        prophet_tuned_log_horizon_table['MAE'][1]
    ],

    'MAE_H2': [
        exp_ts_tuned_cv['MAE'][2],
        exp_ts_tuned_log_cv['MAE'][2],
        sarima_auto_cv['MAE'][2],
        sarima_flex_full_cv['MAE'][2],
        sarima_manual_cv['MAE'][2],
        sarima_log_auto_cv['MAE'][2],
        sarima_log_full_flex_cv['MAE'][2],
        sarima_log_manual_cv['MAE'][2],
        prophet_tuned_horizon_table['MAE'][2],
        prophet_tuned_log_horizon_table['MAE'][2]
    ],

    'MAE_H3': [
        exp_ts_tuned_cv['MAE'][3],
        exp_ts_tuned_log_cv['MAE'][3],
        sarima_auto_cv['MAE'][3],
        sarima_flex_full_cv['MAE'][3],
        sarima_manual_cv['MAE'][3],
        sarima_log_auto_cv['MAE'][3],
        sarima_log_full_flex_cv['MAE'][3],
        sarima_log_manual_cv['MAE'][3],
        prophet_tuned_horizon_table['MAE'][3],
        prophet_tuned_log_horizon_table['MAE'][3]
    ],

    'MAPE_H1': [
        exp_ts_tuned_cv['MAPE'][1],
        exp_ts_tuned_log_cv['MAPE'][1],
        sarima_auto_cv['MAPE'][1],
        sarima_flex_full_cv['MAPE'][1],
        sarima_manual_cv['MAPE'][1],
        sarima_log_auto_cv['MAPE'][1],
        sarima_log_full_flex_cv['MAPE'][1],
        sarima_log_manual_cv['MAPE'][1],
        prophet_tuned_horizon_table['MAPE'][1],
        prophet_tuned_log_horizon_table['MAPE'][1]
    ],

    'MAPE_H2': [
        exp_ts_tuned_cv['MAPE'][2],
        exp_ts_tuned_log_cv['MAPE'][2],
        sarima_auto_cv['MAPE'][2],
        sarima_flex_full_cv['MAPE'][2],
        sarima_manual_cv['MAPE'][2],
        sarima_log_auto_cv['MAPE'][2],
        sarima_log_full_flex_cv['MAPE'][2],
        sarima_log_manual_cv['MAPE'][2],
        prophet_tuned_horizon_table['MAPE'][2],
        prophet_tuned_log_horizon_table['MAPE'][2]
    ],

    'MAPE_H3': [
        exp_ts_tuned_cv['MAPE'][3],
        exp_ts_tuned_log_cv['MAPE'][3],
        sarima_auto_cv['MAPE'][3],
        sarima_flex_full_cv['MAPE'][3],
        sarima_manual_cv['MAPE'][3],
        sarima_log_auto_cv['MAPE'][3],
        sarima_log_full_flex_cv['MAPE'][3],
        sarima_log_manual_cv['MAPE'][3],
        prophet_tuned_horizon_table['MAPE'][3],
        prophet_tuned_log_horizon_table['MAPE'][3]
    ]
})
In [172]:
final_table.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}',
    'MAE_H1': '{:,.0f}',
    'MAE_H2': '{:,.0f}',
    'MAE_H3': '{:,.0f}',
    'MAPE': '{:.0%}',
    'MAPE_H1': '{:.0%}',
    'MAPE_H2': '{:.0%}',
    'MAPE_H3': '{:.0%}'
})
Out[172]:
  Model Log Data Tuning MAE MAPE MAE_H1 MAE_H2 MAE_H3 MAPE_H1 MAPE_H2 MAPE_H3
0 ExponentialSmoothing N Manual 9,559,131 27% 8,232,452 9,662,806 10,782,135 24% 29% 30%
1 ExponentialSmoothing Y Manual 10,529,595 29% 8,454,615 10,832,569 12,301,600 24% 30% 33%
2 Sarima N Short Grid Search 22,830,464 92% 17,133,558 23,203,094 28,154,741 69% 97% 110%
3 Sarima N Grid Search 9,843,491 33% 8,015,458 10,640,699 10,874,315 29% 37% 33%
4 Sarima N Manual 22,289,579 92% 16,481,605 23,139,574 27,247,558 72% 99% 105%
5 Sarima Y Short Grid Search 9,876,261 27% 8,184,446 9,970,251 11,474,088 22% 27% 31%
6 Sarima Y Grid Search 9,324,128 26% 7,426,062 9,404,383 11,141,939 21% 27% 30%
7 Sarima Y Manual 10,270,420 29% 8,355,908 10,535,363 11,919,988 24% 30% 33%
8 Prophet N Grid Search 11,510,718 46% 12,449,653 11,544,815 10,537,687 54% 47% 37%
9 Prophet Y Grid Search 11,333,481 44% 13,120,411 11,263,898 9,616,134 59% 44% 29%

Results¶

Alt text

The Triple Exponential Smoothing (Tuned - note only tuned models were in the final table), non-log transformed data model performed the best with 27% overall MAPE but also the lowest MAE across horizons 1, 2, & 3 (per the table above). I looked at MAE and MAPE together when evaluating. Prophet surprisingly performed poorly. SARIMA was quite comparable for its best model but had results all across the board depending on the adjustments.

In [173]:
final_model=final_table.iloc[[0]]
final_model.style.format({
    'MAE': '{:,.0f}',
    'MAPE': '{:.0%}',
    'MAE_H1': '{:,.0f}',
    'MAE_H2': '{:,.0f}',
    'MAE_H3': '{:,.0f}',
    'MAPE': '{:.0%}',
    'MAPE_H1': '{:.0%}',
    'MAPE_H2': '{:.0%}',
    'MAPE_H3': '{:.0%}'
})
Out[173]:
  Model Log Data Tuning MAE MAPE MAE_H1 MAE_H2 MAE_H3 MAPE_H1 MAPE_H2 MAPE_H3
0 ExponentialSmoothing N Manual 9,559,131 27% 8,232,452 9,662,806 10,782,135 24% 29% 30%
In [174]:
display(Image(url="https://raw.githubusercontent.com/lindsayalexandra14/ds_portfolio/main/1_projects/machine_learning/time_series_and_survival/cocoa_imports/cocoa_imports_final_model_plot.png",
    width=800))
No description has been provided for this image

The best predictions of US Cocoa Bean Imports for the 1st 3 months of 2025 are:

Date Forecast Lower_CI Upper_CI
2025-01-01 59,915,879 31,650,287 67,112,957
2025-02-01 59,477,191 30,411,927 65,506,481
2025-03-01 75,993,848 38,341,516 84,996,332

Future Improvements:

Additions on this project could include adding additional regressors, like the global market price of Cocoa Beans or the supply, and/or forecasting by country or continent. Removing outliers by country first may also improve performance, however recent anomalies due to supply or price may continue moving forward and removing them may be detrimental.