Retrieving market data – quality and consistency as keys to success
Market data is often considered something that cannot contribute to the overall risk in systematic trading. However, this is a massive mistake. There are two key risks associated with market data:
- Issues with receiving data
- Issues with received data
In the next two subsections, we will dive deeper into the preceding risks.
Receiving data – when size does matter
There are two forms in which we get market data: real-time or historical. In both cases, we obtain it from a data vendor, a broker, or directly from an exchange. The difference is that real-time data is used for actual trading (as it reflects what is going on in the market right now) and historical data is used only for research and development, to rebuild hypothetical trades and estimate the theoretical performance of a trading algorithm.
The issues with receiving data are mostly related to real-time data.
Let’s now add some more definitions as we need to acquire some common terminology to move on with market data and ordering.
A request to buy an asset at a certain price is called a bid. It’s like you go to a market and shout, “I want to buy this asset at this price. Is anyone willing to sell it to me?”
A request to sell an asset at a certain price is called an ask or offer. It means that you are ready to sell it to anyone willing to accept your price.
In financial markets, both requests are realized by buy side traders with a limit order (see Chapter 10, Types of Orders and Their Simulation in Python, for a detailed discussion on types of orders).
When another counterparty agrees to place a trade at the order price, a new trade is registered and its information is included in the data stream and distributed across data vendors, brokers, and other recipients. Such a record is called a tick. In other words, a tick is a minimal piece of information in the market data and normally consists of the following fields:
date
time
price
traded volume
counterparty1
counterparty2
The last two fields contain information about actual counterparties and are normally not disclosed or distributed to protect the market participants. Traded volume means the amount of the asset that was traded (number of contracts, or just the amount of money if we are talking about forex).
The main problem with receiving market data in its raw form is that it’s simply overwhelmingly huge. There are so many market participants and so many trading venues that streaming all transactions for just one asset (which is also called a “financial instrument”) may easily reach megabytes per second – receiving it is already a challenge by itself (don’t worry, we are not going to work with data feeds of this sort in this book). Next, even if we are able to receive a data stream with such a throughput, we need to store and handle this data somehow, and thus a very fast database is required. And finally, we need to be able to process this amount of data at an adequate speed, so we need blazingly fast computers.
But there is good news. Despite some strategies (mostly arbitrage and high-frequency trading) do require raw market data in the format just described (also frequently referred to as time and sales data) to identify trading opportunities, most directional trading algorithms are far less sensitive to lack of information about each and every trade. So, data vendors provide data in a compressed format. This becomes possible because most of the raw market data contains sequences of ticks with identical prices, and removing them won’t distort the price movements. This happens because there may be many market participants placing trades at the same price at almost the same time, so by excluding these sequences, we lose information about each transaction but retain information about any change in price. Such a market data stream is often referred to as filtered or cleaned. Besides that, some trades are made at bids, others at asks, and while both bids and asks remain the same, these trades form sequences of trades where prices seem to be different. However, in reality, they are always at the distance of the difference between bids and asks. This difference doesn’t mean that the market price changes. Such a phenomenon is called a bounce and is normally also excluded from cleaned data.
Some vendors and brokers go even further and send snapshots of the market data instead of a filtered data stream. A snapshot is sent at regular time intervals, for example, 100 ms or 1 s, and contains only the following information:
- Date
- Time
- Price at the beginning of the interval (also known as open, or just O)
- Maximum price during the interval (also known as high, or H)
- Minimum price during the interval (also known as low, or L)
- Price at the end of the interval (also known as close, or C)
- Traded volume
Therefore, instead of thousands of ticks, we receive only one tick with seven data fields. This approach dramatically reduces the throughput but is obviously somewhat destructive to the data, and snapshot data may not be suitable for some strategies.
Key takeaway
Be careful with choosing the source of data, especially for live trading, and always make sure it contains sufficient information for your strategy.
Received data – looking at it from a critical angle
After we have successfully received the data, we should make sure it makes sense. Often, data, especially tick data, contains erroneous prices. These prices may be received due to a number of reasons, which we will discuss in detail in Chapter 5, Retrieving and Handling Market Data with Python.
Erroneous, otherwise known as non-market, prices may cause trouble for systematic traders because a single wrong quote may trigger an algorithm to buy or sell something, and such a trade should not have happened according to the strategy logic.
Sometimes, these incorrect quotes can be seen if plotted on a chart. The human eye intuitively expects data points to be within a certain reasonable range and easily catches the outliers, as can be seen in the following chart:
Figure 1.2 – Non-market prices seen on a tick chart
In case we receive snapshots or other compressed data, there could be missing intervals when we receive no quotes. It can happen because of the following:
- The market is closed (scheduled or due to an emergency)
- The data server is down
- The connection is broken
Key takeaway
A robust trading app should have a module capable of checking data consistency and connection persistence.
Alright, we are now aware of the operational risks and know how harmful incorrectly handling market data could be. Anything else? Of course, here comes the main risk: systemic.