Machine Learning for Algorithmic Trading

Market and Fundamental Data – Sources and Techniques

Data has always been an essential driver of trading, and traders have long made efforts to gain an advantage from access to superior information. These efforts date back at least to the rumors that the House of Rothschild benefited handsomely from bond purchases upon advance news about the British victory at Waterloo, which was carried by pigeons across the channel.

Today, investments in faster data access take the shape of the Go West consortium of leading high-frequency trading (HFT) firms that connects the Chicago Mercantile Exchange (CME) with Tokyo. The round-trip latency between the CME and the BATS (Better Alternative Trading System) exchanges in New York has dropped to close to the theoretical limit of eight milliseconds as traders compete to exploit arbitrage opportunities. At the same time, regulators and exchanges have started to introduce speed bumps that slow down trading to limit the adverse effects on competition of uneven access to information.

Traditionally, investors mostly relied on publicly available market and fundamental data. Efforts to create or acquire private datasets, for example, through proprietary surveys, were limited. Conventional strategies focus on equity fundamentals and build financial models on reported financials, possibly combined with industry or macro data to project earnings per share and stock prices. Alternatively, they leverage technical analysis to extract signals from market data using indicators computed from price and volume information.

Machine learning (ML) algorithms promise to exploit market and fundamental data more efficiently than human-defined rules and heuristics, particularly when combined with alternative data, which is the topic of the next chapter. We will illustrate how to apply ML algorithms ranging from linear models to recurrent neural networks (RNNs) to market and fundamental data and generate tradeable signals.

This chapter introduces market and fundamental data sources and explains how they reflect the environment in which they are created. The details of the trading environment matter not only for the proper interpretation of market data but also for the design and execution of your strategy and the implementation of realistic backtesting simulations.

We also illustrate how to access and work with trading and financial statement data from various sources using Python.

In particular, this chapter will cover the following topics:

How market data reflects the structure of the trading environment
Working with trade and quote data at minute frequency
Reconstructing an order book from tick data using Nasdaq ITCH
Summarizing tick data using various types of bars
Working with eXtensible Business Reporting Language (XBRL)-encoded electronic filings
Parsing and combining market and fundamental data to create a price-to-earnings (P/E) series
How to access various market and fundamental data sources using Python

You can find the code samples for this chapter and links to additional resources in the corresponding directory of the GitHub repository. The notebooks include color versions of the images.

Market data reflects its environment

Market data is the product of how traders place orders for a financial instrument directly or through intermediaries on one of the numerous marketplaces, how they are processed, and how prices are set by matching demand and supply. As a result, the data reflects the institutional environment of trading venues, including the rules and regulations that govern orders, trade execution, and price formation. See Harris (2003) for a global overview and Jones (2018) for details on the U.S. market.

Algorithmic traders use algorithms, including ML, to analyze the flow of buy and sell orders and the resulting volume and price statistics to extract trade signals that capture insights into, for example, demand-supply dynamics or the behavior of certain market participants.

We will first review institutional features that impact the simulation of a trading strategy during a backtest before we start working with actual tick data created by one such environment, namely Nasdaq.

Market microstructure – the nuts and bolts

Market microstructure studies how the institutional environment affects the trading process and shapes outcomes like price discovery, bid-ask spreads and quotes, intraday trading behavior, and transaction costs (Madhavan 2000; 2002). It is one of the fastest-growing fields of financial research, propelled by the rapid development of algorithmic and electronic trading.

Today, hedge funds sponsor in-house analysts to track the rapidly evolving, complex details and ensure execution at the best possible market prices and design strategies that exploit market frictions. We will provide only a brief overview of these key concepts before we dive into the data generated by trading. The references contain several sources that treat this subject in great detail.

How to trade – different types of orders

Traders can place various types of buy or sell orders. Some orders guarantee immediate execution, while others may state a price threshold or other conditions that trigger execution. Orders are typically valid for the same trading day unless specified otherwise.

A market order is intended for immediate execution of the order upon arrival at the trading venue, at the price that prevails at that moment. In contrast, a limit order only executes if the market price is higher than the limit for a sell limit order, or lower than the limit for a buy limit order. A stop order, in turn, only becomes active when the market price rises above a specified price for a buy stop order, or falls below a specified price for a sell order. A buy stop order can be used to limit the losses of short sales. Stop orders may also have limits.

Numerous other conditions can be attached to orders. For example, all or none orders prevent partial execution; they are filled only if a specified number of shares is available and can be valid for a day or longer. They require special handling and are not visible to market participants. Fill or kill orders also prevent partial execution but cancel if not executed immediately. Immediate or cancel orders immediately buy or sell the number of shares that are available and cancel the remainder. Not-held orders allow the broker to decide on the time and price of execution. Finally, the market on open/close orders executes on or near the opening or closing of the market. Partial executions are allowed.

Where to trade – from exchanges to dark pools

Securities trade in highly organized and regulated exchanges or with varying degrees of formality in over-the-counter (OTC) markets. An exchange is a central marketplace where buyers and sellers compete for the lowest ask and highest bid, respectively. Exchange regulations typically impose listing and reporting requirements to create transparency and attract more traders and liquidity. OTC markets, such as the Best Market (OTCQX) or the Venture Market (OTCQB), often have lower regulatory barriers. As a result, they are suitable for a far broader range of securities, including bonds or American Depositary Receipts (ADRs; equity listed on a foreign exchange, for example, for Nestlé, S.A.).

Exchanges may rely on bilateral trading or centralized order-driven systems that match all buy and sell orders according to certain rules. Many exchanges use intermediaries that provide liquidity by making markets in certain securities. These intermediaries include dealers that act as principals on their own behalf and brokers that trade as agents on behalf of others. Price formation may occur through auctions, such as in the New York Stock Exchange (NYSE), where the highest bid and lowest offer are matched, or through dealers who buy from sellers and sell to buyers.

Back in the day, companies either registered and traded mostly on the NYSE, or they traded on OTC markets like Nasdaq. On the NYSE, a sole specialist intermediated trades of a given security. The specialist received buy and sell orders via a broker and tracked limit orders in a central order book. Limit orders were executed with a priority based on price and time. Buy market orders routed to the specialist transacted with the lowest ask (and sell market orders routed to the specialist transacted with the highest bid) in the limit order book, prioritizing earlier limit orders in the case of ties. Access to all orders in the central order book allowed the specialist to publish the best bid, ask prices, and set market prices based on the overall buy-sell imbalance.

On Nasdaq, multiple market makers facilitated stock trades. Each dealer provided their best bid and ask price to a central quotation system and stood ready to transact the specified number of shares at the specified prices. Traders would route their orders to the market maker with the best quote via their broker. The competition for orders made execution at fair prices very likely. Market makers ensured a fair and orderly market, provided liquidity, and disseminated prices like specialists but only had access to the orders routed to them as opposed to market-wide supply and demand. This fragmentation could create difficulties in identifying fair value market prices.

Today, trading has fragmented; instead of two principal venues in the US, there are more than thirteen displayed trading venues, including exchanges and (unregulated) alternative trading systems (ATSs) such as electronic communication networks (ECNs). Each reports trades to the consolidated tape, but at different latencies. To make matters more difficult, the rules of engagement for each venue differ with several different pricing and queuing models.

The following table lists some of the larger global exchanges and the trading volumes for the 12 months ending 03/2018 in various asset classes, including derivatives. Typically, a minority of financial instruments account for most trading:

Exchange	Stocks
Exchange	Market cap (USD mn)	# Listed companies	Volume / day (USD mn)	# Shares / day ('000)	# Options / day ('000)
NYSE	23,138,626	2,294	78,410	6,122	1,546
Nasdaq — US	10,375,718	2,968	65,026	7,131	2,609
Japan Exchange Group Inc.	6,287,739	3,618	28,397	3,361	1
Shanghai Stock Exchange	5,022,691	1,421	34,736	9,801
Euronext	4,649,073	1,240	9,410	836	304
Hong Kong Exchanges and Clearing	4,443,082	2,186	12,031	1,174	516
LSE Group	3,986,413	2,622	10,398	1,011
Shenzhen Stock Exchange	3,547,312	2,110	40,244	14,443
Deutsche Boerse AG	2,339,092	506	7,825	475
BSE India Limited	2,298,179	5,439	602	1,105
National Stock Exchange of India Limited	2,273,286	1,952	5,092	10,355
BATS Global Markets - US					1,243
Chicago Board Options Exchange					1,811
International Securities Exchange					1,204

The ATSs mentioned previously include dozens of dark pools that allow traders to execute anonymously. They are estimated to account for 40 percent of all U.S. stock trades in 2017, compared with an estimated 16 percent in 2010. Dark pools emerged in the 1980s when the SEC allowed brokers to match buyers and sellers of big blocks of shares. The rise of high-frequency electronic trading and the 2007 SEC Order Protection rule that intended to spur competition and cut transaction costs through transparency as part of Regulation National Market System (Reg NMS) drove the growth of dark pools, as traders aimed to avoid the visibility of large trades (Mamudi 2017). Reg NMS also established the National Best Bid and Offer (NBBO) mandate for brokers to route orders to venues that offer the best price.

Some ATSs are called dark pools because they do not broadcast pre-trade data, including the presence, price, and amount of buy and sell orders as traditional exchanges are required to do. However, dark pools report information about trades to the Financial Industry Regulatory Authority (FINRA) after they occur. As a result, dark pools do not contribute to the process of price discovery until after trade execution but provide protection against various HFT strategies outlined in the first chapter.

In the next section, we will see how market data captures trading activity and reflect the institutional infrastructure in U.S. markets.

Working with high-frequency data

Two categories of market data cover the thousands of companies listed on U.S. exchanges that are traded under Reg NMS: the consolidated feed combines trade and quote data from each trading venue, whereas each individual exchange offers proprietary products with additional activity information for that particular venue.

In this section, we will first present proprietary order flow data provided by Nasdaq that represents the actual stream of orders, trades, and resulting prices as they occur on a tick-by-tick basis. Then, we will demonstrate how to regularize this continuous stream of data that arrives at irregular intervals into bars of a fixed duration. Finally, we will introduce AlgoSeek's equity minute bar data, which contains consolidated trade and quote information. In each case, we will illustrate how to work with the data using Python so that you can leverage these sources for your trading strategy.

How to work with Nasdaq order book data

The primary source of market data is the order book, which updates in real time throughout the day to reflect all trading activity. Exchanges typically offer this data as a real-time service for a fee; however, they may provide some historical data for free.

In the United States, stock markets provide quotes in three tiers, namely Level L1, L2, and L3, that offer increasingly granular information and capabilities:

Level 1 (L1): Real-time bid- and ask-price information, as available from numerous online sources.
Level 2 (L2): Adds information about bid and ask prices by specific market makers as well as the size and time of recent transactions for better insights into the liquidity of a given equity.
Level 3 (L3): Adds the ability to enter or change quotes, execute orders, and confirm trades and is available only to market makers and exchange member firms. Access to Level 3 quotes permits registered brokers to meet best execution requirements.

The trading activity is reflected in numerous messages about orders sent by market participants. These messages typically conform to the electronic Financial Information eXchange (FIX) communications protocol for the real-time exchange of securities transactions and market data or a native exchange protocol.

Communicating trades with the FIX protocol

Just like SWIFT is the message protocol for back-office (for example, in trade-settlement) messaging, the FIX protocol is the de facto messaging standard for communication before and during trade executions between exchanges, banks, brokers, clearing firms, and other market participants. Fidelity Investments and Salomon Brothers introduced FIX in 1992 to facilitate the electronic communication between broker-dealers and institutional clients who, until then, exchanged information over the phone.

It became popular in global equity markets before expanding into foreign exchange, fixed income and derivatives markets, and further into post-trade to support straight-through processing. Exchanges provide access to FIX messages as a real-time data feed that is parsed by algorithmic traders to track market activity and, for example, identify the footprint of market participants and anticipate their next move.

The sequence of messages allows for the reconstruction of the order book. The scale of transactions across numerous exchanges creates a large amount (~10 TB) of unstructured data that is challenging to process and, hence, can be a source of competitive advantage.

The FIX protocol, currently at version 5.0, is a free and open standard with a large community of affiliated industry professionals. It is self-describing, like the more recent XML, and a FIX session is supported by the underlying Transmission Control Protocol (TCP) layer. The community continually adds new functionality.

The protocol supports pipe-separated key-value pairs, as well as a tag-based FIXML syntax. A sample message that requests a server login would look as follows:

8=FIX.5.0|9=127|35=A|59=theBroker.123456|56=CSERVER|34=1|32=20180117- 08:03:04|57=TRADE|50=any_string|98=2|108=34|141=Y|553=12345|554=passw0rd!|10=131|

There are a few open source FIX implementations in Python that can be used to formulate and parse FIX messages. The service provider Interactive Brokers offers a FIX-based computer-to-computer interface (CTCI) for automated trading (refer to the resources section for this chapter in the GitHub repository).

The Nasdaq TotalView-ITCH data feed

While FIX has a dominant market share, exchanges also offer native protocols. Nasdaq offers a TotalView-ITCH direct data-feed protocol, which allows subscribers to track individual orders for equity instruments from placement to execution or cancellation.

Historical records of this data flow permit the reconstruction of the order book that keeps track of the active limit orders for a specific security. The order book reveals the market depth throughout the day by listing the number of shares being bid or offered at each price point. It may also identify the market participant responsible for specific buy and sell orders unless they are placed anonymously. Market depth is a key indicator of liquidity and the potential price impact of sizable market orders.

In addition to matching market and limit orders, Nasdaq also operates auctions or crosses that execute a large number of trades at market opening and closing. Crosses are becoming more important as passive investing continues to grow and traders look for opportunities to execute larger blocks of stock. TotalView also disseminates the Net Order Imbalance Indicator (NOII) for Nasdaq opening and closing crosses and Nasdaq IPO/Halt Cross.

How to parse binary order messages

The ITCH v5.0 specification declares over 20 message types related to system events, stock characteristics, the placement and modification of limit orders, and trade execution. It also contains information about the net order imbalance before the open and closing cross.

Nasdaq offers samples of daily binary files for several months. The GitHub repository for this chapter contains a notebook, parse_itch_order_flow_messages.ipynb, that illustrates how to download and parse a sample file of ITCH messages. The notebook rebuild_nasdaq_order_book.ipynb then goes on to reconstruct both the executed trades and the order book for any given ticker.

The following table shows the frequency of the most common message types for the sample file date October 30, 2019:

Message type	Order book impact	Number of messages
A	New unattributed limit order	127,214,649
D	Order canceled	123,296,742
U	Order canceled and replaced	25,513,651
E	Full or partial execution; possibly multiple messages for the same original order	7,316,703
X	Modified after partial cancellation	3,568,735
F	Add attributed order	1,423,908
P	Trade message (non-cross)	1,525,363
C	Executed in whole or in part at a price different from the initial display price	129,729
Q	Cross trade message	17,775

For each message, the specification lays out the components and their respective length and data types:

Name	Offset	Length	Value	Notes
Message type	0	1	S	System event message.
Stock locate	1	2	Integer	Always 0.
Tracking number	3	2	Integer	Nasdaq internal tracking number.
Timestamp	5	6	Integer	The number of nanoseconds since midnight.
Order reference number	11	8	Integer	The unique reference number assigned to the new order at the time of receipt.
Buy/sell indicator	19	1	Alpha	The type of order being added: B = Buy Order, and S = Sell Order.
Shares	20	4	Integer	The total number of shares associated with the order being added to the book.
Stock	24	8	Alpha	Stock symbol, right - padded with spaces.
Price	32	4	Price (4)	The display price of the new order. Refer to Data Types in the specification for field processing notes.
Attribution	36	4	Alpha	The Nasdaq market participant identifier associated with the entered order.

Python provides the struct module to parse binary data using format strings that identify the message elements by indicating the length and type of the various components of the byte string as laid out in the specification.

Let's walk through the critical steps required to parse the trading messages and reconstruct the order book:

The ITCH parser relies on the message specifications provided in the file message_types.xlsx (refer to the notebook parse_itch_order_flow_messages.ipynb for details). It assembles format strings according to the formats dictionary:

formats = {
    ('integer', 2): 'H',  # int of length 2 => format string 'H'
    ('integer', 4): 'I',
    ('integer', 6): '6s', # int of length 6 => parse as string, 
      convert later
    ('integer', 8): 'Q',
    ('alpha', 1)  : 's',
    ('alpha', 2)  : '2s',
    ('alpha', 4)  : '4s',
    ('alpha', 8)  : '8s',
    ('price_4', 4): 'I',
    ('price_8', 8): 'Q',
}

The parser translates the message specs into format strings and named tuples that capture the message content:

# Get ITCH specs and create formatting (type, length) tuples
specs = pd.read_csv('message_types.csv')
specs['formats'] = specs[['value', 'length']].apply(tuple, 
                           axis=1).map(formats)
# Formatting for alpha fields
alpha_fields = specs[specs.value == 'alpha'].set_index('name')
alpha_msgs = alpha_fields.groupby('message_type')
alpha_formats = {k: v.to_dict() for k, v in alpha_msgs.formats}
alpha_length = {k: v.add(5).to_dict() for k, v in alpha_msgs.length}
# Generate message classes as named tuples and format strings
message_fields, fstring = {}, {}
for t, message in specs.groupby('message_type'):
    message_fields[t] = namedtuple(typename=t,
                                  field_names=message.name.tolist())
    fstring[t] = '>' + ''.join(message.formats.tolist())

Fields of the alpha type require postprocessing, as defined in the format_alpha function:

def format_alpha(mtype, data):
    """Process byte strings of type alpha"""
    for col in alpha_formats.get(mtype).keys():
        if mtype != 'R' and col == 'stock':
            data = data.drop(col, axis=1)
            continue
        data.loc[:, col] = (data.loc[:, col]
                            .str.decode("utf-8")
                            .str.strip())
        if encoding.get(col):
            data.loc[:, col] = data.loc[:, col].map(encoding.get(col))
    return data

The binary file for a single day contains over 300,000,000 messages that are worth over 9 GB. The script appends the parsed result iteratively to a file in the fast HDF5 format to avoid memory constraints. (Refer to the Efficient data storage with pandas section later in this chapter for more information on the HDF5 format.)

The following (simplified) code processes the binary file and produces the parsed orders stored by message type:

with (data_path / file_name).open('rb') as data:
    while True:
        message_size = int.from_bytes(data.read(2), byteorder='big', 
                       signed=False)
        message_type = data.read(1).decode('ascii')
        message_type_counter.update([message_type])
        record = data.read(message_size - 1)
        message = message_fields[message_type]._make(
            unpack(fstring[message_type], record))
        messages[message_type].append(message)
            
        # deal with system events like market open/close
        if message_type == 'S':
            timestamp = int.from_bytes(message.timestamp, 
                                       byteorder='big')
            if message.event_code.decode('ascii') == 'C': # close
                store_messages(messages)
                break

Summarizing the trading activity for all 8,500 stocks

As expected, a small number of the 8,500-plus securities traded on this day account for most trades:

with pd.HDFStore(itch_store) as store:
    stocks = store['R'].loc[:, ['stock_locate', 'stock']]
    trades = (store['P'].append(
            store['Q'].rename(columns={'cross_price': 'price'}),
            sort=False).merge(stocks))
trades['value'] = trades.shares.mul(trades.price)
trades['value_share'] = trades.value.div(trades.value.sum())
trade_summary = (trades.groupby('stock').value_share
                 .sum().sort_values(ascending=False))
trade_summary.iloc[:50].plot.bar(figsize=(14, 6),
                                 color='darkblue',
                                 title='Share of Traded Value')
f = lambda y, _: '{:.0%}'.format(y)
plt.gca().yaxis.set_major_formatter(FuncFormatter(f))

Figure 2.1 shows the resulting plot:

Figure 2.1: The share of traded value of the 50 most traded securities

How to reconstruct all trades and the order book

The parsed messages allow us to rebuild the order flow for the given day. The 'R' message type contains a listing of all stocks traded during a given day, including information about initial public offerings (IPOs) and trading restrictions.

Throughout the day, new orders are added, and orders that are executed and canceled are removed from the order book. The proper accounting for messages that reference orders placed on a prior date would require tracking the order book over multiple days.

The get_messages() function illustrates how to collect the orders for a single stock that affects trading. (Refer to the ITCH specification for details about each message.) The code is slightly simplified; refer to the notebook rebuild_nasdaq_order_book.ipynb for further details:

def get_messages(date, stock=stock):
    """Collect trading messages for given stock"""
    with pd.HDFStore(itch_store) as store:
        stock_locate = store.select('R', where='stock = 
                                     stock').stock_locate.iloc[0]
        target = 'stock_locate = stock_locate'
        data = {}
        # relevant message types
        messages = ['A', 'F', 'E', 'C', 'X', 'D', 'U', 'P', 'Q']
        for m in messages:
            data[m] = store.select(m,  
              where=target).drop('stock_locate', axis=1).assign(type=m)
    order_cols = ['order_reference_number', 'buy_sell_indicator', 
                  'shares', 'price']
    orders = pd.concat([data['A'], data['F']], sort=False,  
                        ignore_index=True).loc[:, order_cols]
    for m in messages[2: -3]:
        data[m] = data[m].merge(orders, how='left')
    data['U'] = data['U'].merge(orders, how='left',
                                right_on='order_reference_number',
                                left_on='original_order_reference_number',
                                suffixes=['', '_replaced'])
    data['Q'].rename(columns={'cross_price': 'price'}, inplace=True)
    data['X']['shares'] = data['X']['cancelled_shares']
    data['X'] = data['X'].dropna(subset=['price'])
    data = pd.concat([data[m] for m in messages], ignore_index=True, 
                      sort=False)

Reconstructing successful trades—that is, orders that were executed as opposed to those that were canceled from trade-related message types C, E, P, and Q—is relatively straightforward:

def get_trades(m):
    """Combine C, E, P and Q messages into trading records"""
    trade_dict = {'executed_shares': 'shares', 'execution_price': 'price'}
    cols = ['timestamp', 'executed_shares']
    trades = pd.concat([m.loc[m.type == 'E',
                              cols + ['price']].rename(columns=trade_dict),
                        m.loc[m.type == 'C',
                              cols + ['execution_price']]
                        .rename(columns=trade_dict),
                        m.loc[m.type == 'P', ['timestamp', 'price',
                                              'shares']],
                        m.loc[m.type == 'Q',
                              ['timestamp', 'price', 'shares']]
                        .assign(cross=1), ],
                       sort=False).dropna(subset=['price']).fillna(0)
    return trades.set_index('timestamp').sort_index().astype(int)

The order book keeps track of limit orders, and the various price levels for buy and sell orders constitute the depth of the order book. Reconstructing the order book for a given level of depth requires the following steps:

The add_orders() function accumulates sell orders in ascending order and buy orders in descending order for a given timestamp up to the desired level of depth:

def add_orders(orders, buysell, nlevels):
    new_order = []
    items = sorted(orders.copy().items())
    if buysell == 1:
        items = reversed(items)  
    for i, (p, s) in enumerate(items, 1):
        new_order.append((p, s))
        if i == nlevels:
            break
    return orders, new_order

We iterate over all ITCH messages and process orders and their replacements as required by the specification:

for message in messages.itertuples():
    i = message[0]
    if np.isnan(message.buy_sell_indicator):
        continue
    message_counter.update(message.type)
    buysell = message.buy_sell_indicator
    price, shares = None, None
    if message.type in ['A', 'F', 'U']:
        price, shares = int(message.price), int(message.shares)
        current_orders[buysell].update({price: shares})
        current_orders[buysell], new_order = 
          add_orders(current_orders[buysell], buysell, nlevels)
        order_book[buysell][message.timestamp] = new_order
    if message.type in ['E', 'C', 'X', 'D', 'U']:
        if message.type == 'U':
            if not np.isnan(message.shares_replaced):
                price = int(message.price_replaced)
                shares = -int(message.shares_replaced)
        else:
            if not np.isnan(message.price):
                price = int(message.price)
                shares = -int(message.shares)
        if price is not None:
            current_orders[buysell].update({price: shares})
            if current_orders[buysell][price] <= 0:
                current_orders[buysell].pop(price)
            current_orders[buysell], new_order = 
              add_orders(current_orders[buysell], buysell, nlevels)
            order_book[buysell][message.timestamp] = new_order

Figure 2.2 highlights the depth of liquidity at any given point in time using different intensities that visualize the number of orders at different price levels. The left panel shows how the distribution of limit order prices was weighted toward buy orders at higher prices.

The right panel plots the evolution of limit orders and prices throughout the trading day: the dark line tracks the prices for executed trades during market hours, whereas the red and blue dots indicate individual limit orders on a per-minute basis (refer to the notebook for details):

Figure 2.2: AAPL market liquidity according to the order book

From ticks to bars – how to regularize market data

The trade data is indexed by nanoseconds, arrives at irregular intervals, and is very noisy. The bid-ask bounce, for instance, causes the price to oscillate between the bid and ask prices when trade initiation alternates between buy and sell market orders. To improve the noise-signal ratio and the statistical properties of the price series, we need to resample and regularize the tick data by aggregating the trading activity.

We typically collect the open (first), high, low, and closing (last) price and volume (jointly abbreviated as OHLCV) for the aggregated period, alongside the volume-weighted average price (VWAP) and the timestamp associated with the data.

Refer to the normalize_tick_data.ipynb notebook in the folder for this chapter on GitHub for additional details.

The raw material – tick bars

The following code generates a plot of the raw tick price and volume data for AAPL:

stock, date = 'AAPL', '20191030'
title = '{} | {}'.format(stock, pd.to_datetime(date).date()
with pd.HDFStore(itch_store) as store:
    sys_events = store['S'].set_index('event_code') # system events
    sys_events.timestamp = sys_events.timestamp.add(pd.to_datetime(date)).dt.time
    market_open = sys_events.loc['Q', 'timestamp'] 
    market_close = sys_events.loc['M', 'timestamp']
with pd.HDFStore(stock_store) as store:
    trades = store['{}/trades'.format(stock)].reset_index()
trades = trades[trades.cross == 0] # excluding data from open/close crossings
trades.price = trades.price.mul(1e-4) # format price
trades = trades[trades.cross == 0]    # exclude crossing trades
trades = trades.between_time(market_open, market_close) # market hours only
tick_bars = trades.set_index('timestamp')
tick_bars.index = tick_bars.index.time
tick_bars.price.plot(figsize=(10, 5), title=title), lw=1)

Figure 2.3 displays the resulting plot:

Figure 2.3: Tick bars

The tick returns are far from normally distributed, as evidenced by the low p-value of scipy.stats.normaltest:

from scipy.stats import normaltest
normaltest(tick_bars.price.pct_change().dropna())
NormaltestResult(statistic=62408.76562431228, pvalue=0.0)

Plain-vanilla denoising – time bars

Time bars involve trade aggregation by period. The following code gets the data for the time bars:

def get_bar_stats(agg_trades):
    vwap = agg_trades.apply(lambda x: np.average(x.price, 
           weights=x.shares)).to_frame('vwap')
    ohlc = agg_trades.price.ohlc()
    vol = agg_trades.shares.sum().to_frame('vol')
    txn = agg_trades.shares.size().to_frame('txn')
    return pd.concat([ohlc, vwap, vol, txn], axis=1)
resampled = trades.groupby(pd.Grouper(freq='1Min'))
time_bars = get_bar_stats(resampled)

We can display the result as a price-volume chart:

def price_volume(df, price='vwap', vol='vol', suptitle=title, fname=None):
    fig, axes = plt.subplots(nrows=2, sharex=True, figsize=(15, 8))
    axes[0].plot(df.index, df[price])
    axes[1].bar(df.index, df[vol], width=1 / (len(df.index)), 
                color='r')
    xfmt = mpl.dates.DateFormatter('%H:%M')
    axes[1].xaxis.set_major_locator(mpl.dates.HourLocator(interval=3))
    axes[1].xaxis.set_major_formatter(xfmt)
    axes[1].get_xaxis().set_tick_params(which='major', pad=25)
    axes[0].set_title('Price', fontsize=14)
    axes[1].set_title('Volume', fontsize=14)
    fig.autofmt_xdate()
    fig.suptitle(suptitle)
    fig.tight_layout()
    plt.subplots_adjust(top=0.9)
price_volume(time_bars)

The preceding code produces Figure 2.4:

Figure 2.4: Time bars

Alternatively, we can represent the data as a candlestick chart using the Bokeh plotting library:

resampled = trades.groupby(pd.Grouper(freq='5Min')) # 5 Min bars for better print
df = get_bar_stats(resampled)
increase = df.close > df.open
decrease = df.open > df.close
w = 2.5 * 60 * 1000 # 2.5 min in ms
WIDGETS = "pan, wheel_zoom, box_zoom, reset, save"
p = figure(x_axis_type='datetime', tools=WIDGETS, plot_width=1500, 
          title = "AAPL Candlestick")
p.xaxis.major_label_orientation = pi/4
p.grid.grid_line_alpha=0.4
p.segment(df.index, df.high, df.index, df.low, color="black")
p.vbar(df.index[increase], w, df.open[increase], df.close[increase], 
       fill_color="#D5E1DD", line_color="black")
p.vbar(df.index[decrease], w, df.open[decrease], df.close[decrease], 
       fill_color="#F2583E", line_color="black")
show(p)

This produces the plot in Figure 2.5:

Figure 2.5: Bokeh candlestick plot

Accounting for order fragmentation – volume bars

Time bars smooth some of the noise contained in the raw tick data but may fail to account for the fragmentation of orders. Execution-focused algorithmic trading may aim to match the volume-weighted average price (VWAP) over a given period. This will divide a single order into multiple trades and place orders according to historical patterns. Time bars would treat the same order differently, even though no new information has arrived in the market.

Volume bars offer an alternative by aggregating trade data according to volume. We can accomplish this as follows:

min_per_trading_day = 60 * 7.5
trades_per_min = trades.shares.sum() / min_per_trading_day
trades['cumul_vol'] = trades.shares.cumsum()
df = trades.reset_index()
by_vol = (df.groupby(df.cumul_vol.
                     div(trades_per_min)
                     .round().astype(int)))
vol_bars = pd.concat([by_vol.timestamp.last().to_frame('timestamp'),
                      get_bar_stats(by_vol)], axis=1)
price_volume(vol_bars.set_index('timestamp'))

We get the plot in Figure 2.6 for the preceding code:

Figure 2.6: Volume bars

Accounting for price changes – dollar bars

When asset prices change significantly, or after stock splits, the value of a given amount of shares changes. Volume bars do not correctly reflect this and can hamper the comparison of trading behavior for different periods that reflect such changes. In these cases, the volume bar method should be adjusted to utilize the product of shares and prices to produce dollar bars.

The following code shows the computation for dollar bars:

value_per_min = trades.shares.mul(trades.price).sum()/(60*7.5) # min per trading day
trades['cumul_val'] = trades.shares.mul(trades.price).cumsum()
df = trades.reset_index()
by_value = df.groupby(df.cumul_val.div(value_per_min).round().astype(int))
dollar_bars = pd.concat([by_value.timestamp.last().to_frame('timestamp'), get_bar_stats(by_value)], axis=1)
price_volume(dollar_bars.set_index('timestamp'), 
             suptitle=f'Dollar Bars | {stock} | {pd.to_datetime(date).date()}')

The plot looks quite similar to the volume bar since the price has been fairly stable throughout the day:

Figure 2.7: Dollar bars

AlgoSeek minute bars – equity quote and trade data

AlgoSeek provides historical intraday data of the quality previously available only to institutional investors. The AlgoSeek Equity bars provide very detailed intraday quote and trade data in a user-friendly format, which is aimed at making it easy to design and backtest intraday ML-driven strategies. As we will see, the data includes not only OHLCV information but also information on the bid-ask spread and the number of ticks with up and down price moves, among others.

AlgoSeek has been so kind as to provide samples of minute bar data for the Nasdaq 100 stocks from 2013-2017 for demonstration purposes and will make a subset of this data available to readers of this book.

In this section, we will present the available trade and quote information and show how to process the raw data. In later chapters, we will demonstrate how you can use this data for ML-driven intraday strategies.

From the consolidated feed to minute bars

AlgoSeek minute bars are based on data provided by the Securities Information Processor (SIP), which manages the consolidated feed mentioned at the beginning of this section. You can view the documentation at https://www.algoseek.com/samples/.

The SIP aggregates the best bid and offers quotes from each exchange, as well as the resulting trades and prices. Exchanges are prohibited by law from sending their quotes and trades to direct feeds before sending them to the SIP. Given the fragmented nature of U.S. equity trading, the consolidated feed provides a convenient snapshot of the current state of the market.

More importantly, the SIP acts as the benchmark used by regulators to determine the National Best Bid and Offer (NBBO) according to Reg NMS. The OHLC bar quote prices are based on the NBBO, and each bid or ask quote price refers to an NBBO price.

Every exchange publishes its top-of-book price and the number of shares available at that price. The NBBO changes when a published quote improves the NBBO. Bid/ask quotes persist until there is a change due to trade, price improvement, or the cancelation of the latest bid or ask. While historical OHLC bars are often based on trades during the bar period, NBBO bid/ask quotes may be carried forward from the previous bar until there is a new NBBO event.

AlgoSeek bars cover the whole trading day, from the opening of the first exchange until the closing of the last exchange. Bars outside regular market hours normally exhibit limited activity. Trading hours, in Eastern Time, are:

Premarket: Approximately 04:00:00 (this varies by exchange) to 09:29:59
Market: 09:30:00 to 16:00:00
Extended hours: 16:00:01 to 20:00:00

Quote and trade data fields

The minute bar data contains up to 54 fields. There are eight fields for the open, high, low, and close elements of the bar, namely:

The timestamp for the bar and the corresponding trade
The price and the size for the prevailing bid-ask quote and the relevant trade

There are also 14 data points with volume information for the bar period:

The number of shares and corresponding trades
The trade volumes at or below the bid, between the bid quote and the midpoint, at the midpoint, between the midpoint and the ask quote, and at or above the ask, as well as for crosses
The number of shares traded with upticks or downticks, that is, when the price rose or fell, as well as when the price did not change, differentiated by the previous direction of price movement

The AlgoSeek data also contains the number of shares reported to FINRA and processed internally at broker-dealers, by dark pools, or OTC. These trades represent volume that is hidden or not publicly available until after the fact.

Finally, the data includes the volume-weighted average price (VWAP) and minimum and maximum bid-ask spread for the bar period.

How to process AlgoSeek intraday data

In this section, we'll process the AlgoSeek sample data. The data directory on GitHub contains instructions on how to download that data from AlgoSeek.

The minute bar data comes in four versions: with and without quote information, and with or without FINRA's reported volume. There is one zipped folder per day, containing one CSV file per ticker.

The following code example extracts the trade-only minute bar data into daily .parquet files:

directories = [Path(d) for d in ['1min_trades']]
target = directory / 'parquet'
for zipped_file in directory.glob('*/**/*.zip'):
    fname = zipped_file.stem
    print('\t', fname)
    zf = ZipFile(zipped_file)
    files = zf.namelist()
    data = (pd.concat([pd.read_csv(zf.open(f),
                                   parse_dates=[['Date',
                                                 'TimeBarStart']])
                       for f in files],
                      ignore_index=True)
            .rename(columns=lambda x: x.lower())
            .rename(columns={'date_timebarstart': 'date_time'})
            .set_index(['ticker', 'date_time']))
    data.to_parquet(target / (fname + '.parquet'))

We can combine the parquet files into a single piece of HDF5 storage as follows, yielding 53.8 million records that consume 3.2 GB of memory and covering 5 years and 100 stocks:

path = Path('1min_trades/parquet')
df = pd.concat([pd.read_parquet(f) for f in path.glob('*.parquet')]).dropna(how='all', axis=1)
df.columns = ['open', 'high', 'low', 'close', 'trades', 'volume', 'vwap']
df.to_hdf('data.h5', '1min_trades')
print(df.info(null_counts=True))
MultiIndex: 53864194 entries, (AAL, 2014-12-22 07:05:00) to (YHOO, 2017-06-16 19:59:00)
Data columns (total 7 columns):
open      53864194 non-null float64
high      53864194 non-null float64
Low       53864194 non-null float64
close     53864194 non-null float64
trades    53864194 non-null int64
volume    53864194 non-null int64
vwap      53852029 non-null float64

We can use plotly to quickly create an interactive candlestick plot for one day of AAPL data to view in a browser:

idx = pd.IndexSlice
with pd.HDFStore('data.h5') as store:
    print(store.info())
    df = (store['1min_trades']
          .loc[idx['AAPL', '2017-12-29'], :]
          .reset_index())
fig = go.Figure(data=go.Ohlc(x=df.date_time,
                             open=df.open,
                             high=df.high,
                             low=df.low,
                             close=df.close))

Figure 2.8 shows the resulting static image:

Figure 2.8: Plotly candlestick plot

AlgoSeek also provides adjustment factors to correct pricing and volumes for stock splits, dividends, and other corporate actions.

Filter reviews by

All

Feefo verified reviews

Amazon verified reviews

Alexandr Gonchar Oct 27, 2020

The media could not be loaded. There is a good amount of literature on machine learning algorithms fundamentals and advances and the same I can tell about finance and econometrics books. The literature on the intersection of these fields is used to be very highly specialized and rather too complicated for the newcomers.This new book "Machine Learning for Algorithmic Trading" aims exactly to fill this gap and guides a reader through a clear roadmap:- getting and cleaning the data;- extracting predictive signals;- build trading strategies;- build portfolios of assets and strategies;- test their performance historically and in the simulations.Last but not least, this book is relying on popular and proven Python libraries and solutions alongside state-of-the-art techniques as generative adversarial networks for simulations and reinforcement learning for active trading. Highly recommend for new-coming practitioners and seasoned players in the field!

Amazon Verified review

Kutschera Apr 06, 2021

Good read

Yuxing Yan Aug 18, 2020

A very good textbook for quantitative finance major students. If we have an MSF program, I will definitively adopt it as my textbook. Since all Python programs are available at Github, it will help me and my students to replicate many trading strategies explained in the book. It is my habit to replicate others’ results first, then try to modify their programs according to my own needs.

Akshit shah Aug 15, 2020

As a machine learning practitioner.I always wanted to go into the trading space and apply my knowledge, I have been looking up the internet and had so many overwhelming knowledge. It was really difficult to find an ideal book. Then I came across this book and It helped me understand different ways and how to develop different models and understand the use of it in algorithmic trading. Thanks a lot for this book

Frank S Dec 03, 2022

For example, yes all of the photos are black and white. However in the preface there's very clear instruction on where you can find the color versions (PDFs in the github repo, for those who eschew prefaces) and if you intend to use any of the Python code that goes along with this tome, you'll see the color versions can often also be found in the Jupyter notebooks - a fact frequently referenced in the first two chapters.Second, getting python environments up and running smoothly is, unfortunately, rarely a very easy task. This is certainly not exclusive to this particular use case.If I had any complaint at all about the book, it's that it is overly thorough so you may find yourself slogging through some tedium as you begin. It's not broken up in a way that easily allows for skipping ahead (at least not for my prior knowledge set). In the first couple chapters I've found I've needed - on average - every other paragraph and that the subject matter is the source of the dryness, not the author's use of language; which so far has been smooth and flows far more gracefully than my own.I will update after more extensive use of the author's code.

Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python , Second Edition

What do you get with Print?

Machine Learning for Algorithmic Trading

Market and Fundamental Data – Sources and Techniques

Market data reflects its environment

Market microstructure – the nuts and bolts

How to trade – different types of orders

Where to trade – from exchanges to dark pools

Working with high-frequency data

How to work with Nasdaq order book data

Communicating trades with the FIX protocol

The Nasdaq TotalView-ITCH data feed

How to parse binary order messages

Summarizing the trading activity for all 8,500 stocks

How to reconstruct all trades and the order book

From ticks to bars – how to regularize market data

The raw material – tick bars

Plain-vanilla denoising – time bars

Accounting for order fragmentation – volume bars

Accounting for price changes – dollar bars

AlgoSeek minute bars – equity quote and trade data

From the consolidated feed to minute bars

Quote and trade data fields

How to process AlgoSeek intraday data

Page 1 of 7

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python , Second Edition

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs