Working with high-frequency data
Two categories of market data cover the thousands of companies listed on U.S. exchanges that are traded under Reg NMS: the consolidated feed combines trade and quote data from each trading venue, whereas each individual exchange offers proprietary products with additional activity information for that particular venue.
In this section, we will first present proprietary order flow data provided by Nasdaq that represents the actual stream of orders, trades, and resulting prices as they occur on a tick-by-tick basis. Then, we will demonstrate how to regularize this continuous stream of data that arrives at irregular intervals into bars of a fixed duration. Finally, we will introduce AlgoSeek's equity minute bar data, which contains consolidated trade and quote information. In each case, we will illustrate how to work with the data using Python so that you can leverage these sources for your trading strategy.
How to work with Nasdaq order book data
The primary source of market data is the order book, which updates in real time throughout the day to reflect all trading activity. Exchanges typically offer this data as a real-time service for a fee; however, they may provide some historical data for free.
In the United States, stock markets provide quotes in three tiers, namely Level L1, L2, and L3, that offer increasingly granular information and capabilities:
- Level 1 (L1): Real-time bid- and ask-price information, as available from numerous online sources.
- Level 2 (L2): Adds information about bid and ask prices by specific market makers as well as the size and time of recent transactions for better insights into the liquidity of a given equity.
- Level 3 (L3): Adds the ability to enter or change quotes, execute orders, and confirm trades and is available only to market makers and exchange member firms. Access to Level 3 quotes permits registered brokers to meet best execution requirements.
The trading activity is reflected in numerous messages about orders sent by market participants. These messages typically conform to the electronic Financial Information eXchange (FIX) communications protocol for the real-time exchange of securities transactions and market data or a native exchange protocol.
Communicating trades with the FIX protocol
Just like SWIFT is the message protocol for back-office (for example, in trade-settlement) messaging, the FIX protocol is the de facto messaging standard for communication before and during trade executions between exchanges, banks, brokers, clearing firms, and other market participants. Fidelity Investments and Salomon Brothers introduced FIX in 1992 to facilitate the electronic communication between broker-dealers and institutional clients who, until then, exchanged information over the phone.
It became popular in global equity markets before expanding into foreign exchange, fixed income and derivatives markets, and further into post-trade to support straight-through processing. Exchanges provide access to FIX messages as a real-time data feed that is parsed by algorithmic traders to track market activity and, for example, identify the footprint of market participants and anticipate their next move.
The sequence of messages allows for the reconstruction of the order book. The scale of transactions across numerous exchanges creates a large amount (~10 TB) of unstructured data that is challenging to process and, hence, can be a source of competitive advantage.
The FIX protocol, currently at version 5.0, is a free and open standard with a large community of affiliated industry professionals. It is self-describing, like the more recent XML, and a FIX session is supported by the underlying Transmission Control Protocol (TCP) layer. The community continually adds new functionality.
The protocol supports pipe-separated key-value pairs, as well as a tag-based FIXML syntax. AÂ sample message that requests a server login would look as follows:
8=FIX.5.0|9=127|35=A|59=theBroker.123456|56=CSERVER|34=1|32=20180117- 08:03:04|57=TRADE|50=any_string|98=2|108=34|141=Y|553=12345|554=passw0rd!|10=131|
There are a few open source FIX implementations in Python that can be used to formulate and parse FIX messages. The service provider Interactive Brokers offers a FIX-based computer-to-computer interface (CTCI) for automated trading (refer to the resources section for this chapter in the GitHub repository).
The Nasdaq TotalView-ITCH data feed
While FIX has a dominant market share, exchanges also offer native protocols. Nasdaq offers a TotalView-ITCH direct data-feed protocol, which allows subscribers to track individual orders for equity instruments from placement to execution or cancellation.
Historical records of this data flow permit the reconstruction of the order book that keeps track of the active limit orders for a specific security. The order book reveals the market depth throughout the day by listing the number of shares being bid or offered at each price point. It may also identify the market participant responsible for specific buy and sell orders unless they are placed anonymously. Market depth is a key indicator of liquidity and the potential price impact of sizable market orders.
In addition to matching market and limit orders, Nasdaq also operates auctions or crosses that execute a large number of trades at market opening and closing. Crosses are becoming more important as passive investing continues to grow and traders look for opportunities to execute larger blocks of stock. TotalView also disseminates the Net Order Imbalance Indicator (NOII) for Nasdaq opening and closing crosses and Nasdaq IPO/Halt Cross.
How to parse binary order messages
The ITCH v5.0 specification declares over 20 message types related to system events, stock characteristics, the placement and modification of limit orders, and trade execution. It also contains information about the net order imbalance before the open and closing cross.
Nasdaq offers samples of daily binary files for several months. The GitHub repository for this chapter contains a notebook, parse_itch_order_flow_messages.ipynb
, that illustrates how to download and parse a sample file of ITCH messages. The notebook rebuild_nasdaq_order_book.ipynb
then goes on to reconstruct both the executed trades and the order book for any given ticker.
The following table shows the frequency of the most common message types for the sample file date October 30, 2019:
Message type |
Order book impact |
Number of messages |
A |
New unattributed limit order |
127,214,649 |
D |
Order canceled |
123,296,742 |
U |
Order canceled and replaced |
25,513,651 |
E |
Full or partial execution; possibly multiple messages for the same original order |
7,316,703 |
X |
Modified after partial cancellation |
3,568,735 |
F |
Add attributed order |
1,423,908 |
P |
Trade message (non-cross) |
1,525,363 |
C |
Executed in whole or in part at a price different from the initial display price |
129,729 |
Q |
Cross trade message |
17,775 |
For each message, the specification lays out the components and their respective length and data types:
Name |
Offset |
Length |
Value |
Notes |
Message type |
0 |
1 |
S |
System event message. |
Stock locate |
1 |
2 |
Integer |
Always 0. |
Tracking number |
3 |
2 |
Integer |
Nasdaq internal tracking number. |
Timestamp |
5 |
6 |
Integer |
The number of nanoseconds since midnight. |
Order reference number |
11 |
8 |
Integer |
The unique reference number assigned to the new order at the time of receipt. |
Buy/sell indicator |
19 |
1 |
Alpha |
The type of order being added: B = Buy Order, and S = Sell Order. |
Shares |
20 |
4 |
Integer |
The total number of shares associated with the order being added to the book. |
Stock |
24 |
8 |
Alpha |
Stock symbol, right - padded with spaces. |
Price |
32 |
4 |
Price (4) |
The display price of the new order. Refer to Data Types in the specification for field processing notes. |
Attribution |
36 |
4 |
Alpha |
The Nasdaq market participant identifier associated with the entered order. |
Python provides the struct
module to parse binary data using format strings that identify the message elements by indicating the length and type of the various components of the byte
string as laid out in the specification.
Let's walk through the critical steps required to parse the trading messages and reconstruct the order book:
- The ITCH parser relies on the message specifications provided in the file
message_types.xlsx
(refer to the notebookparse_itch_order_flow_messages.ipynb
for details). It assembles format strings according to theformats
dictionary:formats = { ('integer', 2): 'H', # int of length 2 => format string 'H' ('integer', 4): 'I', ('integer', 6): '6s', # int of length 6 => parse as string, convert later ('integer', 8): 'Q', ('alpha', 1) : 's', ('alpha', 2) : '2s', ('alpha', 4) : '4s', ('alpha', 8) : '8s', ('price_4', 4): 'I', ('price_8', 8): 'Q', }
- The parser translates the message specs into format strings and named tuples that capture the message content:
# Get ITCH specs and create formatting (type, length) tuples specs = pd.read_csv('message_types.csv') specs['formats'] = specs[['value', 'length']].apply(tuple, axis=1).map(formats) # Formatting for alpha fields alpha_fields = specs[specs.value == 'alpha'].set_index('name') alpha_msgs = alpha_fields.groupby('message_type') alpha_formats = {k: v.to_dict() for k, v in alpha_msgs.formats} alpha_length = {k: v.add(5).to_dict() for k, v in alpha_msgs.length} # Generate message classes as named tuples and format strings message_fields, fstring = {}, {} for t, message in specs.groupby('message_type'): message_fields[t] = namedtuple(typename=t, field_names=message.name.tolist()) fstring[t] = '>' + ''.join(message.formats.tolist())
- Fields of the alpha type require postprocessing, as defined in the
format_alpha
function:def format_alpha(mtype, data): """Process byte strings of type alpha""" for col in alpha_formats.get(mtype).keys(): if mtype != 'R' and col == 'stock': data = data.drop(col, axis=1) continue data.loc[:, col] = (data.loc[:, col] .str.decode("utf-8") .str.strip()) if encoding.get(col): data.loc[:, col] = data.loc[:, col].map(encoding.get(col)) return data
The binary file for a single day contains over 300,000,000 messages that are worth over 9 GB. The script appends the parsed result iteratively to a file in the fast HDF5 format to avoid memory constraints. (Refer to the Efficient data storage with pandas section later in this chapter for more information on the HDF5 format.)
The following (simplified) code processes the binary file and produces the parsed orders stored by message type:
with (data_path / file_name).open('rb') as data:
while True:
message_size = int.from_bytes(data.read(2), byteorder='big',
signed=False)
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
record = data.read(message_size - 1)
message = message_fields[message_type]._make(
unpack(fstring[message_type], record))
messages[message_type].append(message)
# deal with system events like market open/close
if message_type == 'S':
timestamp = int.from_bytes(message.timestamp,
byteorder='big')
if message.event_code.decode('ascii') == 'C': # close
store_messages(messages)
break
Summarizing the trading activity for all 8,500 stocks
As expected, a small number of the 8,500-plus securities traded on this day account for most trades:
with pd.HDFStore(itch_store) as store:
stocks = store['R'].loc[:, ['stock_locate', 'stock']]
trades = (store['P'].append(
store['Q'].rename(columns={'cross_price': 'price'}),
sort=False).merge(stocks))
trades['value'] = trades.shares.mul(trades.price)
trades['value_share'] = trades.value.div(trades.value.sum())
trade_summary = (trades.groupby('stock').value_share
.sum().sort_values(ascending=False))
trade_summary.iloc[:50].plot.bar(figsize=(14, 6),
color='darkblue',
title='Share of Traded Value')
f = lambda y, _: '{:.0%}'.format(y)
plt.gca().yaxis.set_major_formatter(FuncFormatter(f))
Figure 2.1 shows the resulting plot:
Figure 2.1: The share of traded value of the 50 most traded securities
How to reconstruct all trades and the order book
The parsed messages allow us to rebuild the order flow for the given day. The 'R'
message type contains a listing of all stocks traded during a given day, including information about initial public offerings (IPOs) and trading restrictions.
Throughout the day, new orders are added, and orders that are executed and canceled are removed from the order book. The proper accounting for messages that reference orders placed on a prior date would require tracking the order book over multiple days.
The get_messages()
function illustrates how to collect the orders for a single stock that affects trading. (Refer to the ITCH specification for details about each message.) The code is slightly simplified; refer to the notebook rebuild_nasdaq_order_book.ipynb
for further details:
def get_messages(date, stock=stock):
"""Collect trading messages for given stock"""
with pd.HDFStore(itch_store) as store:
stock_locate = store.select('R', where='stock =
stock').stock_locate.iloc[0]
target = 'stock_locate = stock_locate'
data = {}
# relevant message types
messages = ['A', 'F', 'E', 'C', 'X', 'D', 'U', 'P', 'Q']
for m in messages:
data[m] = store.select(m,
where=target).drop('stock_locate', axis=1).assign(type=m)
order_cols = ['order_reference_number', 'buy_sell_indicator',
'shares', 'price']
orders = pd.concat([data['A'], data['F']], sort=False,
ignore_index=True).loc[:, order_cols]
for m in messages[2: -3]:
data[m] = data[m].merge(orders, how='left')
data['U'] = data['U'].merge(orders, how='left',
right_on='order_reference_number',
left_on='original_order_reference_number',
suffixes=['', '_replaced'])
data['Q'].rename(columns={'cross_price': 'price'}, inplace=True)
data['X']['shares'] = data['X']['cancelled_shares']
data['X'] = data['X'].dropna(subset=['price'])
data = pd.concat([data[m] for m in messages], ignore_index=True,
sort=False)
Reconstructing successful trades—that is, orders that were executed as opposed to those that were canceled from trade-related message types C
, E
, P
, and Q
—is relatively straightforward:
def get_trades(m):
"""Combine C, E, P and Q messages into trading records"""
trade_dict = {'executed_shares': 'shares', 'execution_price': 'price'}
cols = ['timestamp', 'executed_shares']
trades = pd.concat([m.loc[m.type == 'E',
cols + ['price']].rename(columns=trade_dict),
m.loc[m.type == 'C',
cols + ['execution_price']]
.rename(columns=trade_dict),
m.loc[m.type == 'P', ['timestamp', 'price',
'shares']],
m.loc[m.type == 'Q',
['timestamp', 'price', 'shares']]
.assign(cross=1), ],
sort=False).dropna(subset=['price']).fillna(0)
return trades.set_index('timestamp').sort_index().astype(int)
The order book keeps track of limit orders, and the various price levels for buy and sell orders constitute the depth of the order book. Reconstructing the order book for a given level of depth requires the following steps:
The add_orders()
function accumulates sell orders in ascending order and buy orders in descending order for a given timestamp up to the desired level of depth:
def add_orders(orders, buysell, nlevels):
new_order = []
items = sorted(orders.copy().items())
if buysell == 1:
items = reversed(items)
for i, (p, s) in enumerate(items, 1):
new_order.append((p, s))
if i == nlevels:
break
return orders, new_order
We iterate over all ITCH messages and process orders and their replacements as required by the specification:
for message in messages.itertuples():
i = message[0]
if np.isnan(message.buy_sell_indicator):
continue
message_counter.update(message.type)
buysell = message.buy_sell_indicator
price, shares = None, None
if message.type in ['A', 'F', 'U']:
price, shares = int(message.price), int(message.shares)
current_orders[buysell].update({price: shares})
current_orders[buysell], new_order =
add_orders(current_orders[buysell], buysell, nlevels)
order_book[buysell][message.timestamp] = new_order
if message.type in ['E', 'C', 'X', 'D', 'U']:
if message.type == 'U':
if not np.isnan(message.shares_replaced):
price = int(message.price_replaced)
shares = -int(message.shares_replaced)
else:
if not np.isnan(message.price):
price = int(message.price)
shares = -int(message.shares)
if price is not None:
current_orders[buysell].update({price: shares})
if current_orders[buysell][price] <= 0:
current_orders[buysell].pop(price)
current_orders[buysell], new_order =
add_orders(current_orders[buysell], buysell, nlevels)
order_book[buysell][message.timestamp] = new_order
Figure 2.2 highlights the depth of liquidity at any given point in time using different intensities that visualize the number of orders at different price levels. The left panel shows how the distribution of limit order prices was weighted toward buy orders at higher prices.
The right panel plots the evolution of limit orders and prices throughout the trading day: the dark line tracks the prices for executed trades during market hours, whereas the red and blue dots indicate individual limit orders on a per-minute basis (refer to the notebook for details):
Figure 2.2: AAPL market liquidity according to the order book
From ticks to bars – how to regularize market data
The trade data is indexed by nanoseconds, arrives at irregular intervals, and is very noisy. The bid-ask bounce, for instance, causes the price to oscillate between the bid and ask prices when trade initiation alternates between buy and sell market orders. To improve the noise-signal ratio and the statistical properties of the price series, we need to resample and regularize the tick data by aggregating the trading activity.
We typically collect the open (first), high, low, and closing (last) price and volume (jointly abbreviated as OHLCV) for the aggregated period, alongside the volume-weighted average price (VWAP) and the timestamp associated with the data.
Refer to the normalize_tick_data.ipynb
notebook in the folder for this chapter on GitHub for additional details.
The raw material – tick bars
The following code generates a plot of the raw tick price and volume data for AAPL:
stock, date = 'AAPL', '20191030'
title = '{} | {}'.format(stock, pd.to_datetime(date).date()
with pd.HDFStore(itch_store) as store:
sys_events = store['S'].set_index('event_code') # system events
sys_events.timestamp = sys_events.timestamp.add(pd.to_datetime(date)).dt.time
market_open = sys_events.loc['Q', 'timestamp']
market_close = sys_events.loc['M', 'timestamp']
with pd.HDFStore(stock_store) as store:
trades = store['{}/trades'.format(stock)].reset_index()
trades = trades[trades.cross == 0] # excluding data from open/close crossings
trades.price = trades.price.mul(1e-4) # format price
trades = trades[trades.cross == 0] # exclude crossing trades
trades = trades.between_time(market_open, market_close) # market hours only
tick_bars = trades.set_index('timestamp')
tick_bars.index = tick_bars.index.time
tick_bars.price.plot(figsize=(10, 5), title=title), lw=1)
Figure 2.3 displays the resulting plot:
Figure 2.3: Tick bars
The tick returns are far from normally distributed, as evidenced by the low p-value of scipy.stats.normaltest
:
from scipy.stats import normaltest
normaltest(tick_bars.price.pct_change().dropna())
NormaltestResult(statistic=62408.76562431228, pvalue=0.0)
Plain-vanilla denoising – time bars
Time bars involve trade aggregation by period. The following code gets the data for the time bars:
def get_bar_stats(agg_trades):
vwap = agg_trades.apply(lambda x: np.average(x.price,
weights=x.shares)).to_frame('vwap')
ohlc = agg_trades.price.ohlc()
vol = agg_trades.shares.sum().to_frame('vol')
txn = agg_trades.shares.size().to_frame('txn')
return pd.concat([ohlc, vwap, vol, txn], axis=1)
resampled = trades.groupby(pd.Grouper(freq='1Min'))
time_bars = get_bar_stats(resampled)
We can display the result as a price-volume chart:
def price_volume(df, price='vwap', vol='vol', suptitle=title, fname=None):
fig, axes = plt.subplots(nrows=2, sharex=True, figsize=(15, 8))
axes[0].plot(df.index, df[price])
axes[1].bar(df.index, df[vol], width=1 / (len(df.index)),
color='r')
xfmt = mpl.dates.DateFormatter('%H:%M')
axes[1].xaxis.set_major_locator(mpl.dates.HourLocator(interval=3))
axes[1].xaxis.set_major_formatter(xfmt)
axes[1].get_xaxis().set_tick_params(which='major', pad=25)
axes[0].set_title('Price', fontsize=14)
axes[1].set_title('Volume', fontsize=14)
fig.autofmt_xdate()
fig.suptitle(suptitle)
fig.tight_layout()
plt.subplots_adjust(top=0.9)
price_volume(time_bars)
The preceding code produces Figure 2.4:
Figure 2.4: Time bars
Alternatively, we can represent the data as a candlestick chart using the Bokeh plotting library:
resampled = trades.groupby(pd.Grouper(freq='5Min')) # 5 Min bars for better print
df = get_bar_stats(resampled)
increase = df.close > df.open
decrease = df.open > df.close
w = 2.5 * 60 * 1000 # 2.5 min in ms
WIDGETS = "pan, wheel_zoom, box_zoom, reset, save"
p = figure(x_axis_type='datetime', tools=WIDGETS, plot_width=1500,
title = "AAPL Candlestick")
p.xaxis.major_label_orientation = pi/4
p.grid.grid_line_alpha=0.4
p.segment(df.index, df.high, df.index, df.low, color="black")
p.vbar(df.index[increase], w, df.open[increase], df.close[increase],
fill_color="#D5E1DD", line_color="black")
p.vbar(df.index[decrease], w, df.open[decrease], df.close[decrease],
fill_color="#F2583E", line_color="black")
show(p)
This produces the plot in Figure 2.5:
Figure 2.5: Bokeh candlestick plot
Accounting for order fragmentation – volume bars
Time bars smooth some of the noise contained in the raw tick data but may fail to account for the fragmentation of orders. Execution-focused algorithmic trading may aim to match the volume-weighted average price (VWAP) over a given period. This will divide a single order into multiple trades and place orders according to historical patterns. Time bars would treat the same order differently, even though no new information has arrived in the market.
Volume bars offer an alternative by aggregating trade data according to volume. We can accomplish this as follows:
min_per_trading_day = 60 * 7.5
trades_per_min = trades.shares.sum() / min_per_trading_day
trades['cumul_vol'] = trades.shares.cumsum()
df = trades.reset_index()
by_vol = (df.groupby(df.cumul_vol.
div(trades_per_min)
.round().astype(int)))
vol_bars = pd.concat([by_vol.timestamp.last().to_frame('timestamp'),
get_bar_stats(by_vol)], axis=1)
price_volume(vol_bars.set_index('timestamp'))
We get the plot in Figure 2.6 for the preceding code:
Figure 2.6: Volume bars
Accounting for price changes – dollar bars
When asset prices change significantly, or after stock splits, the value of a given amount of shares changes. Volume bars do not correctly reflect this and can hamper the comparison of trading behavior for different periods that reflect such changes. In these cases, the volume bar method should be adjusted to utilize the product of shares and prices to produce dollar bars.
The following code shows the computation for dollar bars:
value_per_min = trades.shares.mul(trades.price).sum()/(60*7.5) # min per trading day
trades['cumul_val'] = trades.shares.mul(trades.price).cumsum()
df = trades.reset_index()
by_value = df.groupby(df.cumul_val.div(value_per_min).round().astype(int))
dollar_bars = pd.concat([by_value.timestamp.last().to_frame('timestamp'), get_bar_stats(by_value)], axis=1)
price_volume(dollar_bars.set_index('timestamp'),
suptitle=f'Dollar Bars | {stock} | {pd.to_datetime(date).date()}')
The plot looks quite similar to the volume bar since the price has been fairly stable throughout the day:
Figure 2.7: Dollar bars
AlgoSeek minute bars – equity quote and trade data
AlgoSeek provides historical intraday data of the quality previously available only to institutional investors. The AlgoSeek Equity bars provide very detailed intraday quote and trade data in a user-friendly format, which is aimed at making it easy to design and backtest intraday ML-driven strategies. As we will see, the data includes not only OHLCV information but also information on the bid-ask spread and the number of ticks with up and down price moves, among others.
AlgoSeek has been so kind as to provide samples of minute bar data for the Nasdaq 100 stocks from 2013-2017 for demonstration purposes and will make a subset of this data available to readers of this book.
In this section, we will present the available trade and quote information and show how to process the raw data. In later chapters, we will demonstrate how you can use this data for ML-driven intraday strategies.
From the consolidated feed to minute bars
AlgoSeek minute bars are based on data provided by the Securities Information Processor (SIP), which manages the consolidated feed mentioned at the beginning of this section. You can view the documentation at https://www.algoseek.com/samples/.
The SIP aggregates the best bid and offers quotes from each exchange, as well as the resulting trades and prices. Exchanges are prohibited by law from sending their quotes and trades to direct feeds before sending them to the SIP. Given the fragmented nature of U.S. equity trading, the consolidated feed provides a convenient snapshot of the current state of the market.
More importantly, the SIP acts as the benchmark used by regulators to determine the National Best Bid and Offer (NBBO) according to Reg NMS. The OHLC bar quote prices are based on the NBBO, and each bid or ask quote price refers to an NBBO price.
Every exchange publishes its top-of-book price and the number of shares available at that price. The NBBO changes when a published quote improves the NBBO. Bid/ask quotes persist until there is a change due to trade, price improvement, or the cancelation of the latest bid or ask. While historical OHLC bars are often based on trades during the bar period, NBBO bid/ask quotes may be carried forward from the previous bar until there is a new NBBO event.
AlgoSeek bars cover the whole trading day, from the opening of the first exchange until the closing of the last exchange. Bars outside regular market hours normally exhibit limited activity. Trading hours, in Eastern Time, are:
- Premarket: Approximately 04:00:00 (this varies by exchange) to 09:29:59
- Market: 09:30:00 to 16:00:00
- Extended hours: 16:00:01 to 20:00:00
Quote and trade data fields
The minute bar data contains up to 54 fields. There are eight fields for the open, high, low, and close elements of the bar, namely:
- The timestamp for the bar and the corresponding trade
- The price and the size for the prevailing bid-ask quote and the relevant trade
There are also 14 data points with volume information for the bar period:
- The number of shares and corresponding trades
- The trade volumes at or below the bid, between the bid quote and the midpoint, at the midpoint, between the midpoint and the ask quote, and at or above the ask, as well as for crosses
- The number of shares traded with upticks or downticks, that is, when the price rose or fell, as well as when the price did not change, differentiated by the previous direction of price movement
The AlgoSeek data also contains the number of shares reported to FINRA and processed internally at broker-dealers, by dark pools, or OTC. These trades represent volume that is hidden or not publicly available until after the fact.
Finally, the data includes the volume-weighted average price (VWAP) and minimum and maximum bid-ask spread for the bar period.
How to process AlgoSeek intraday data
In this section, we'll process the AlgoSeek sample data. The data
directory on GitHub contains instructions on how to download that data from AlgoSeek.
The minute bar data comes in four versions: with and without quote information, and with or without FINRA's reported volume. There is one zipped folder per day, containing one CSV file per ticker.
The following code example extracts the trade-only minute bar data into daily .parquet
files:
directories = [Path(d) for d in ['1min_trades']]
target = directory / 'parquet'
for zipped_file in directory.glob('*/**/*.zip'):
fname = zipped_file.stem
print('\t', fname)
zf = ZipFile(zipped_file)
files = zf.namelist()
data = (pd.concat([pd.read_csv(zf.open(f),
parse_dates=[['Date',
'TimeBarStart']])
for f in files],
ignore_index=True)
.rename(columns=lambda x: x.lower())
.rename(columns={'date_timebarstart': 'date_time'})
.set_index(['ticker', 'date_time']))
data.to_parquet(target / (fname + '.parquet'))
We can combine the parquet
files into a single piece of HDF5 storage as follows, yielding 53.8 million records that consume 3.2 GB of memory and covering 5 years and 100 stocks:
path = Path('1min_trades/parquet')
df = pd.concat([pd.read_parquet(f) for f in path.glob('*.parquet')]).dropna(how='all', axis=1)
df.columns = ['open', 'high', 'low', 'close', 'trades', 'volume', 'vwap']
df.to_hdf('data.h5', '1min_trades')
print(df.info(null_counts=True))
MultiIndex: 53864194 entries, (AAL, 2014-12-22 07:05:00) to (YHOO, 2017-06-16 19:59:00)
Data columns (total 7 columns):
open 53864194 non-null float64
high 53864194 non-null float64
Low 53864194 non-null float64
close 53864194 non-null float64
trades 53864194 non-null int64
volume 53864194 non-null int64
vwap 53852029 non-null float64
We can use plotly
to quickly create an interactive candlestick plot for one day of AAPL data to view in a browser:
idx = pd.IndexSlice
with pd.HDFStore('data.h5') as store:
print(store.info())
df = (store['1min_trades']
.loc[idx['AAPL', '2017-12-29'], :]
.reset_index())
fig = go.Figure(data=go.Ohlc(x=df.date_time,
open=df.open,
high=df.high,
low=df.low,
close=df.close))
Figure 2.8 shows the resulting static image:
Figure 2.8: Plotly candlestick plot
AlgoSeek also provides adjustment factors to correct pricing and volumes for stock splits, dividends, and other corporate actions.