Machine learning-enabled strategies in electronic trading proliferate. We are seeing trading companies moving machine learning into the hot path of the ultra-low latency trading cycle - a task that remains challenging. 5 microseconds of delay can draw the line between a successful and unsuccessful trading strategy. Machine learning inference often takes up most of the latency of the trading loop and in most cases does not operate at microseconds latency. In this article, we take a closer look at low-latency machine learning. We focus on the machine learning algorithms XGBoost, LightGBM and CatBoost, which are widely used frameworks in many different applications and which can provide ML inference in microseconds if done right.
These algorithms share a commonality: They are based on learned decision tree ensembles, which are trained on existing datasets and then applied to new incoming data (prediction / inference phase). The input data format is a vector of numerical and/or categorical values, called features, which are often the outcome of a preprocessing step. Using ensembles of many decision trees (1000s) with a voting mechanism boosts the prediction accuracy and prevents overfitting on the training data.
A trained XGBoost model enables a systematic analysis of financial data to identify complex patterns, correlations, and trends in price movements. For example, based on historical data, short-term stock price movements can be predicted more accurately than with traditional methods. The ML model helps decipher the intricate details of order flow, bid-ask spreads, and liquidity dynamics to help predict price fluctuations.
The inference latency of a machine learning model is measured as the time between the incoming data being sent to the model and the prediction result being received from the model. The large number of different machine learning / AI algorithms must be caused by the inference latency of their models. Large Language Models (LLMs) are very sophisticated but have inference latency in the order of seconds. Long Short-Term Memory (LSTM) models are popular methods for analyzing time series data.
However, their inference latency does not go below a few tens of microseconds, even when run on highly optimized FPGA accelerators. Here we focus on Gradient Boosting Machines (GBMs), of which XGBoost, LightGBM, and CatBoost are the best-known representatives. GBMs provide a prediction accuracy comparable to LSTMs. As we shall see below, GBM models can provide inference latencies in the single-digit microsecond range (sub-microsecond in certain circumstances).
Ultra-low latency XGBoost inference is only achievable with special-purpose accelerator chips. In this case, the latency is not only composed by the execution time of the machine learning model itself, but also by the time it takes to the data to arrive at the chip. For example, a typical data flow is: submission of new data to a software API, propagation through the driver and PCIe bus of the accelerator card, execution time of the ML model on the chip, propagation of the resulting prediction back, (through the PCIe bus and the driver) to the software API. We refer to this as the API latency because it provides the true, end-to-end latency seen by the user.
The test setup is shown in the figure below. Xelera Silva, the software package for XGBoost, LightGBM and CatBoost inference acceleration offloads the ML model execution to AMD hardware accelerator platforms and provides a high-performance C++ API to the user interface. The software loads models created with the default open-source versions of XGBoost, LightGBM or CatBoost. In this benchmark below, LightGBM models have been used (please contact us about for more details including XGBoost or CatBoost benchmarks).
The benchmark parameters are listed below. In typical use cases, more than one ML model is loaded into the accelerator and multiple models execute simultaneously on the same input vector.
The benchmark is performed with the hardware specified in the table above. We sweep over two of these parameters: the number of decision trees per model (model size) and the number of models executing concurrently. The API latencies are shown in microseconds in the table below.
All measurements include the overhead for transferring the data and results through the driver and over the PCIe bus to the accelerator card. The results show a stable latency that remains invariant with the number of concurrent models on the accelerator and that increases only by a small amount with larger models.
In electronic trading latency is extremely important. At the same time, robust methods for short-term predictions of financial time series are of great value, and an increasing number of firms use machine learning algorithms to achieve this goal. XGBoost, LightGBM and CatBoost are popular frameworks in this area. However, low-latency machine learning is challenging because most algorithm implementations do not execute in the single-digit microseconds domain, which is required for many ML-based high-speed strategies to become useful. We show that single-digit microseconds latency can be achieved with a purpose-built accelerator architecture. On the other hand, there are possibilities to further reduce the latency as we discuss below.
There are several ways to further reduce the latency of Xelera Silva. First, the I/O interface (in this case the PCIe bus) plays a dominant role in the overall latency budget. Higher PCIe generations as well as cache-coherent CPU accelerator interfaces such as CXL (available on upcoming AMD Alveo accelerator boards) will have a favorable impact on end-to-end latency. Second, significant latency reduction is expected from an inline variant of Xelera Silva - one that eliminates PCIe transfer altogether and accesses the on-chip inference engine via the accelerator card's network ports.