Model Development Strategies for Time Series Forecasting
In this presentation, Jun Kim talks about leading a team of data scientists at American Express to create time series models to predict future financial trends. He focuses more on the practical application of model development rather than actual model theory.
Time Series Background
Time series are a sequence of data points that are spaced equidistant from one another and are represented chronologically over time. Just as in other machine learning models, the original dataset is broken into a training dataset and a testing dataset to be able to measure the margin of error within a models’ predictions. One distinct difference is that this is not randomized and this test-train split will occur at a specific date in time.
“Time series allow you communicate complex scenarios to senior leadership who do not know anything about machine learning or the statistics behind a model.”
Jun speaks specifically about how to implement time series in the financial realm relating to client services. As he describes, this can combine metrics that are reported at the daily level by different products and different clients. Normally, this would quickly generate a large, overwhelming amount of data which may be difficult to isolate trends for analysis and predictions. Time series aid in categorizing data in a way that forecasting models can accurately predict future trends and lend the ability to create visualizations that are easily understood.
Time series can be thought of operating on your typical X and Y axis, with the X axis representing your independent variable (“the cause”) and the Y axis representing the dependent variable (“the effect”). The complexity behind time series stems from the fact that you are trying to predict “the effects” when you do not necessarily know how “the causes” are defined. This concept of working with future data can quickly lead to large margin of errors if constraints are no correctly applied to the model features and variables.
One example of how time series predictions can differ can be compared between a Multiple Forecasting model (figure A below) and a Lagged Variable model (figure B below). Proper time series use a lagged variable approach where a regression equation is used to predict dependent variables based on historical and current values. This allows for variables to capture more recent information, yearly seasonality, and averages over different time periods. Multiple Forecasting on the other hand is an aggregation of predictions based off of other predictions. This can spiral quickly in to a large margin of error as each deviation from absolute truth grows exponentially with each iterative prediction. This is a good, simple example of how aggregating all variables to generate a single prediction is more logical and accurate than a series of smaller predictions.
Jun speaks to two different time series models that he and team tested. The first, the SARIMAX model, was utilized due to its’ ability to capture trends, seasonality, and exogenous variables. This model benefits in its flexibility to operate using data at different levels of granularity (daily, weekly, monthly) as well as being able to more meticulously feature engineer data prior to model runs. The second model used is the Prophet Model which is a time series model developed by Facebook and has been open source since 2017. This model offers daily forecasting that can easily take holiday effects and seasonality into account while remaining very fast and accurate. However, if there are too many segments to process or there is not clear trend or seasonality this model may not compensate well.
Perhaps the biggest benefit that time series possess, and which Jun heavily emphasizes, is the ability to visualize the results easily. Both Power BI and Tableau are powerful tools that can take the results from time series predictions and visualize in a manner that the layman can understand. This allows for complicated scenarios to be shown to upper management without the need for an in depth, behind the scenes explanation. Instead, it allows for the focus to remain on the results and what actions are to be taken from them. Additionally, the tools mentioned are interactive. As questions can, and will , arise when presenting results, these visualizations become dynamic and allow the user to target specific sections or features to draw out hidden truths within the results.
While the tools mentioned are incredibly powerful, it is the structure and build of a time series that allows for such manipulation to be possible. Succinct visualizations and the ability to condense complex topics allow for Jun and his team at American Express to continue to utilize time series to their full potential.