Chapter 10 Preparing Historical Market Data

Chapter 10
Preparing Historical Market Data

"Without data, you’re just another person with an opinion."

— W. Edwards Deming

Imagine you’re offered a time machine — one that lets you revisit the entire history of the stock market, day by day. You can watch every price swing, every recovery, every crash. There’s just one catch: you can’t change the past. All you can do is observe, take notes, and try to learn something that might help you in the future.

That’s exactly what historical market data is.

It’s not a prediction engine, nor a crystal ball. But it’s the next best thing — a complete record of human behavior under greed, fear, euphoria, and panic. If you want to build a trading strategy grounded in reality, this is where you start.

In this chapter, we’ll talk about how we gather that data, what it looks like, how to prepare it for backtesting, and why small details like “close” versus “adjusted close” matter more than they seem. Once we have clean, usable data, we’ll be ready to simulate our first strategy: Buy and Hold.

10.1 Getting the Data

The historical data used throughout this book was sourced from a public financial dataset that includes daily trading activity for SPY, starting from January 29, 1993 through today. While the precise origin of the dataset is not essential, it’s important that the data be:

Broadly available or reproducible
Accurate and free from major gaps
Cleaned and aligned to trading days

In practical applications, historical data often comes from financial APIs, downloadable CSV files, or public data repositories. In this book, we’ll assume you already have access to a clean, well-structured dataset — and focus instead on how to interpret it and get it ready for backtesting.

10.2 Understanding the Dataset Structure

Each row in the dataset represents a single trading day and includes key market data fields:

date — The trading date in YYYY-MM-DD format
open — The first traded price of the day
high — The highest price reached during the trading session
low — The lowest price reached during the session
close — The final traded price of the day
adjclose — The closing price adjusted for dividends and stock splits
volume — The total number of shares traded that day

Here’s an example of what a data row might look like:


date	open	high	low	close	adjclose	volume

2025-04-25	546.65	549.20	543.70	549.04	549.04	25,381,939

Not every field will be used in every strategy. In this book, most of our analysis and strategy logic will rely on just three columns:

date
close
adjclose

The choice between close and adjclose depends on what you’re trying to simulate, as discussed below.

10.3 When to Use Close vs. Adjusted Close

The close price represents the actual market closing price on a given trading day. The adjclose (adjusted close) price reflects adjustments for dividends and stock splits, and is intended to preserve continuity in long-term price series.

This distinction is critical in backtesting:

Use adjusted close when testing strategies that involve buying and holding shares of SPY. Adjusted close ensures your return calculations correctly reflect dividend reinvestments and split adjustments over time.
Use close when testing strategies involving call options or other derivatives. Options do not pay dividends, and their prices already incorporate dividend expectations. For this reason, all options-based strategies in this book use the raw close price.
Use close for calculating indicators like moving averages (SMA/EMA). Indicators reflect price momentum and patterns observed by market participants — and nearly all technical traders use unadjusted prices for this purpose.

Even if you are simulating a dividend-reinvesting stock strategy, your moving averages and other technical indicators should be based on close, not adjclose. This aligns with real market behavior — as traders and institutions make decisions based on the current traded price, not its adjusted equivalent.

10.4 Why Daily Data?

You may wonder: why use daily data rather than weekly or minute-by-minute data?

The answer lies in scope and relevance. Most long-term trading strategies, especially those involving moving averages or investment-based decisions, do not depend on short-term price noise. Daily data:

Balances detail and clarity
Keeps simulations fast and interpretable
Matches the frequency of most long-term strategy signals

If your goal is to capture large trends and avoid major drawdowns — rather than scalp minute-to-minute moves — daily data is the right tool.

10.5 Preprocessing the Data

Before we can run any strategy, we need to prepare the dataset. This includes two key steps:

1. Data Validation and Cleaning

Ensure that all trading days are sequential and that there are no missing rows
Verify column names and data formats are consistent (e.g., date formats, numeric fields)
Handle missing or anomalous values (e.g., zero volume, NaNs in price fields)

SPY is a highly liquid and mature asset, so these issues are rare — but it’s still good practice to validate.

2. Feature Generation

Most strategies require additional computed columns. For example:

sma_200 — the 200-day simple moving average of the close price
ema_50 — the 50-day exponential moving average
price_above_sma — a boolean column indicating whether the price is above the SMA

These columns are calculated ahead of time and added to the dataset. Once this is done, the strategy logic becomes a matter of simple rule evaluation — no need to recompute indicators every time the strategy runs.

10.6 What Comes Next

At this point, we have a clean, structured, and feature-rich dataset — the foundation of every strategy we’ll build.

In the next chapter, we’ll simulate the simplest strategy of all: Buy and Hold. It requires no indicators or signals — just an initial investment and the patience to wait. While simple, it sets the baseline against which all other strategies will be measured.

Let’s get started.

[next] [prev] [prev-tail] [front] [up]