Chapter 10
Preparing Historical Market Data
Imagine you’re offered a time machine — one that lets you revisit the entire history of the stock market, day by day. You can watch every price swing, every recovery, every crash. There’s just one catch: you can’t change the past. All you can do is observe, take notes, and try to learn something that might help you in the future.
That’s exactly what historical market data is.
It’s not a prediction engine, nor a crystal ball. But it’s the next best thing — a complete record of human behavior under greed, fear, euphoria, and panic. If you want to build a trading strategy grounded in reality, this is where you start.
In this chapter, we’ll talk about how we gather that data, what it looks like, how to prepare it for backtesting, and why small details like “close” versus “adjusted close” matter more than they seem. Once we have clean, usable data, we’ll be ready to simulate our first strategy: Buy and Hold.
10.1 Getting the Data
The historical data used throughout this book was sourced from a public financial dataset that includes daily trading activity for SPY, starting from January 29, 1993 through today. While the precise origin of the dataset is not essential, it’s important that the data be:
- Broadly available or reproducible
- Accurate and free from major gaps
- Cleaned and aligned to trading days
In practical applications, historical data often comes from financial APIs, downloadable CSV files, or public data repositories. In this book, we’ll assume you already have access to a clean, well-structured dataset — and focus instead on how to interpret it and get it ready for backtesting.
10.2 Understanding the Dataset Structure
Each row in the dataset represents a single trading day and includes key market data fields:
- date — The trading date in YYYY-MM-DD format
- open — The first traded price of the day
- high — The highest price reached during the trading session
- low — The lowest price reached during the session
- close — The final traded price of the day
- adjclose — The closing price adjusted for dividends and stock splits
- volume — The total number of shares traded that day
Here’s an example of what a data row might look like:
date | open | high | low | close | adjclose | volume |
2025-04-25 | 546.65 | 549.20 | 543.70 | 549.04 | 549.04 | 25,381,939 |
Not every field will be used in every strategy. In this book, most of our analysis and strategy logic will rely on just three columns:
- date
- close
- adjclose
The choice between close and adjclose depends on what you’re trying to simulate, as discussed below.
10.3 When to Use Close vs. Adjusted Close
The close price represents the actual market closing price on a given trading day. The adjclose (adjusted close) price reflects adjustments for dividends and stock splits, and is intended to preserve continuity in long-term price series.
This distinction is critical in backtesting:
- Use adjusted close when testing strategies that involve buying and holding shares of SPY. Adjusted close ensures your return calculations correctly reflect dividend reinvestments and split adjustments over time.
- Use close when testing strategies involving call options or other derivatives. Options do not pay dividends, and their prices already incorporate dividend expectations. For this reason, all options-based strategies in this book use the raw close price.
- Use close for calculating indicators like moving averages (SMA/EMA). Indicators reflect price momentum and patterns observed by market participants — and nearly all technical traders use unadjusted prices for this purpose.
Even if you are simulating a dividend-reinvesting stock strategy, your moving averages and other technical indicators should be based on close, not adjclose. This aligns with real market behavior — as traders and institutions make decisions based on the current traded price, not its adjusted equivalent.
10.4 Why Daily Data?
You may wonder: why use daily data rather than weekly or minute-by-minute data?
The answer lies in scope and relevance. Most long-term trading strategies, especially those involving moving averages or investment-based decisions, do not depend on short-term price noise. Daily data:
- Balances detail and clarity
- Keeps simulations fast and interpretable
- Matches the frequency of most long-term strategy signals
If your goal is to capture large trends and avoid major drawdowns — rather than scalp minute-to-minute moves — daily data is the right tool.
10.5 Preprocessing the Data
Before we can run any strategy, we need to prepare the dataset. This includes two key steps:
1. Data Validation and Cleaning
- Ensure that all trading days are sequential and that there are no missing rows
- Verify column names and data formats are consistent (e.g., date formats, numeric fields)
- Handle missing or anomalous values (e.g., zero volume, NaNs in price fields)
SPY is a highly liquid and mature asset, so these issues are rare — but it’s still good practice to validate.
2. Feature Generation
Most strategies require additional computed columns. For example:
- sma_200 — the 200-day simple moving average of the close price
- ema_50 — the 50-day exponential moving average
- price_above_sma — a boolean column indicating whether the price is above the SMA
These columns are calculated ahead of time and added to the dataset. Once this is done, the strategy logic becomes a matter of simple rule evaluation — no need to recompute indicators every time the strategy runs.
10.6 What Comes Next
At this point, we have a clean, structured, and feature-rich dataset — the foundation of every strategy we’ll build.
In the next chapter, we’ll simulate the simplest strategy of all: Buy and Hold. It requires no indicators or signals — just an initial investment and the patience to wait. While simple, it sets the baseline against which all other strategies will be measured.
Let’s get started.