Author Archives: Ash

Predicting Stock Prices with Machine Learning – Part 1- Introduction

This post is the first in a series that will explore the forecasting of stock prices using machine learning (ML) methods (for a quick intro to ML see my previous post). If you’ve spent any reasonable amount of time with me then you’ll know that I tend not to talk too kindly of papers that attempt to forecast equity prices using machine learning. However, here we are simply using equity price time-series as a great example of a non-stationary process.

Price prediction or, more generally, models for generating alpha are not the best use of ML in the quantitative trading process. Far from it. At some point I’ll dedicate a post solely to describing the architecture of a quantitative trading system but for now let’s just say that portfolio optimisation, the mixture of prediction models and algorithmic execution are tasks better suited to ML.

All the same, it is possible to forecast long term price movement with ML. This series of posts will be based on work for a paper that I published earlier this year called Automated trading with performance weighted random forests and seasonality where I demonstrated the power of the online generation of ML models and suggested a novel and highly successful way to combine the predictions of multiple models.

Before we get to the nitty-gritty of combining model outputs, we first need to cover some housekeeping essentials . Initially, we’ll look at the input data and how we turn this into useful features for our model. Without data, we’re nothing so this step is arguably our most important. Next we’ll go on to look out how to measure the performance of our prediction systems including a number of important and all-too-often forgotten metrics for understanding the long term success (or not) of our model. Finally we’ll get to the fun stuff and begin to train some ML models. We’ll start simple and add layers of complexity with associated justifications along the way.

I leave it at that for now. Below is a list of the posts to come and I’ll hyperlink the items as I get them written:

  • Part 2 – Data and features
  • Part 3 – Performance Metrics
  • Part 4 – Standard Methods
  • Part 5 – Ensemble Methods
  • Part 6 – Incorporating “online” performance weighting
  • Part 7 – Summary

What’s in store…

Firstly, apologies. It’s been a long time since I last posted but it’s been a hectic few months. This summer saw me moving house, changing jobs, writing up my PhD thesis and getting burgled so it’s been an interesting one to say the least.

I can’t deny that getting the thesis submitted was a struggle. Consolidating four years of research and experimentation that spanned a veritable plethora of academic disciplines* into a single book that no one will read was a mind mangling task that required hard-to-muster motivation. Chaos aside, I came across a great number of things that I plan to post about in the coming weeks and that’s exactly the subject of this post.

The next series of posts will concern the use of machine learning methods for stock picking. I know, I know, you don’t have tell me that’s a bad idea! You may have even read one of my previous posts that discouraged exactly this sort of tomfoolery. However, the objective of this work was not to build a Skynet type device that provides magical insight into the equity markets. As you will see,  this research uses equity price data to explore the best ways to produce stable predictions in non-stationary time-series using only simple modifications to well-documented machine learning methodologies.

After exploring prediction of non-stationary processes we’ll get a little more application specific and explore the use of these techniques for forecasting price impact of large equity orders using depth-of-book data but more on this when the time comes. For now, thanks for bearing with me and I hope you enjoy what’s to come.

* Artificial intelligence, machine learning, mathematical finance, agent-based modelling and complexity theory were the major players.

What is machine learning? A simple introduction

Even amongst practitioners, there is no truly well accepted definition for machine learning. So, I’ll provide two:

  • Pioneer machine learning researcher Arthur Samuel defined machine learning as: “the field of study that gives computers the ability to learn without being explicitly programmed”. This definition is beautiful in its simplicity though lacks a little formality. So, with a little more structure,..
  • Tom Mitchell states that “a computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E”.

Let’s reinforce the definitions with an example. A classic practical application is the email spam filter. The email program watches which emails the user does or does not mark as spam and, based on that, learns how to better filter future spam automatically. In the parlance of Tim Mitchell’s definition, classifying the emails as spam or not span is the task, T, watching the user label emails as spam or not spam is the experience, E, and the fraction emails correctly classified could be the perform measure, P.

There are a great number of machine learning algorithms and, as such, they are often divided into three main types: supervised, unsupervised and reinforcement learning algorithms.

Supervised Learning – machine learning with labels

Before providing a definition, let’s start with an example. Imagine you want to predict the stopping distance for cars given the speed that the car is travelling. The graph below shows some data from the “cars” dataset in R.

distance vs speed

A supervised learning algorithm would allow us to use the data available to make a general rule for making  predictions about future distances for speeds that we have not yet witnessed. In our two-variable example, this is the well-know task of fitting a line to the data. Eyeballing the data, we could conceivably fit a linear model (red) or a polynomial model (blue) shown below.

distance vs speed LINdistance vs speed POL

Fitting these models is an example of a supervised learning algorithm. The term supervised learning refers to the fact the algorithm requires a dataset for training that contain the “right” answers. That is to say, in our example, for every datapoint on the cars speed, we also had the corresponding data for the actual stopping distance.

The cars example is also a case of a regression problem, where we are predicting a continuous valued output (the distance).

Another type of supervised learning task is classification. Again, let’s set the scene with an example.


The figure above shows data from the well-known iris database. It shows a scatter plot of the sepal length vs. petal length of a number of iris plants. The points are coloured by species. Here, the machine learning task is to predict the species given new petal and sepal measurements. Which species would you label the new data-point in black? What makes this a classification task is that the variable to be predicted (species) is discrete valued.

 Unsupervised Learning – machine learning without labels

With unsupervised learning, the data contain no labels and the machine learning algorithm is tasked with finding structure in the data.

One very common type of unsupervised learning is known as clustering and is used to for categorisation of google news. Each day, google algorithms crawl the web for news stories and use clustering algorithms to group similar stories together. Other pertinent examples of clustering include: organising computer clusters, social network analysis and market segmentation.

There are a number of other unsupervised learning algorithms and indeed a number of other types of machine learning than we have not touched upon in this post. If you’ve found this page interesting and have been inspired to leaner more, I recommend the following books:

Also, for some great advice on the practical application of machine learning methods as well as a detailed derivation of some common algorithms, I strongly recommend Andrew Ng’s Coursera course .

Streamlining your LaTex workflow with LaTexMK

It’s been a *long* time since I last wrote a post and that’s mainly because I’ve been frantically writing my thesis.

For scientific writing I’m a big fan of LaTex for a number of reasons: dealing with and formatting mathematical notation is a joy, the implicit handling of intra-document references and bibliography makes life a lot easier and the separation of content and formatting helps me better focus on my writing.

That said, I do like to preview changes that I make and I’ve always found the workflow a little clunky. Having to save the tex file, head to the terminal and enter:

pdflatex mytexfile
bibtex mytexfile
pdflatex mytexfile
pdflatex mytexfile

Before opening up a PDF viewer to take a look at the result. It’s tedious but fear not, I’ve since grown tired and (like a good little computer scientist) managed to automate the process using a tool called latexmk to dramatically improve my latex workflow.


Effectively what latexmk does is watch your tex source file for changes and then run whatever it needs to in order to update your PDF automatically. Now, I simply open up a tex file and start latexmk in a Terminal window. Each time I save the source file, latexmk automatically runs in the background and  opens (or simply updates)  the PDF. This way, I never have to leave my tex editor nor manually run latex at all. I just save the file. On Mac OS X, I can even scroll through the PDF without removing focus from the tex editor.

Getting set up

I’ve explained how I set things up below but beware that this is specific to Mac OS X.

  • Download and install latexmk (but be aware it may have been included with your Tex distribution – run ‘which latexmk’ in the terminal to check).
  • Next, create a latexmk config file in your home directory, ~/.latexmkrc, and add the following lines:
    $pdf_previewer = "open -a /Applications/";
    $clean_ext = "paux lox pdfsync out";

    Obviously, you can use any PDF viewer but I strongly recommend skim.

  • If, like me,  you mostly use latexmk with pdf files, you can add the following to your ~/.bash_profile:
    alias latexmk=' -pdf -pvc'

    You can always run with other options if you need to.

Once you’re done, just ‘cd’ to the directory containing your latex source file, and run “mklatex myfile.tex”. Now you just leave it running while you work and it’ll take care of things automatically! Job done.


Limit Order Books – An introduction

This is an extract from a draft version of my PhD thesis:

“…For many years, the majority of the worlds financial markets have been driven by a style of auction, very similar to the basic process of haggling, known as the continuous double auction (CDA). In a CDA a seller may announce an offer or accept a bid at any time and a buyer may announce a bid or accept an offer at any time. This continuous and asynchronous process does away with any need for a centralised auctioneer, but does need a system for recording bids and offers and clearing trades. In modern financial markets, this function is performed by a uniform trading protocol known as the limit order book (LOB), whose universal adoption was a major factor in the transformation of financial exchanges.

The most common type of order submitted to a LOB is the limit order – an instruction to buy or sell a given quantity of an asset, that species a limit (worst acceptable) price which cannot be surpassed. Upon receiving a limit order, the exchange’s matching engine compares the order’s price and quantity with opposing orders from the book. If there is is a book order that matches the incoming order then a trade is executed. The new order is termed aggressive (or marketable) because it initiated the trade, while the existing order from the book is deemed passive. If, on the other hand, the there are no matches for the incoming order it is placed in the book along with the other unmatched orders, waiting for an opposing aggressive order to arrive (or until it is cancelled). A visualisation of the structure and mechanism of a LOB is given in Figure 1.1.

limit order book

Figure 1.1: An illustration of LOB structure and dynamics.

The details of oder matching vary across exchanges and assets classes. However, most modern equity markets operate using a price-time priority protocol. That is, the lowest offers and highest bids are considered first, while orders of the same price are differentiated by the time they arrive (with priority given to orders that arrive first). Thus, limit orders with identical prices form a first-in first-out (FIFO) queues.

Most LOB-driven exchanges offer many more order types than the simple limit order. Another particularly common order type is the market order, which ensures a trade executes immediately at the best available price for a given quantity. As a result, market orders demand liquidity and risk uncertainty. Many more order types are available that allow control over whether an order may be partially filled, when an order should become active and how visible the order is. Such order types include: conditional orders, hybrid orders, iceberg orders, stop orders and pegged orders, but the intricacies of these order types are beyond the scope of this report.

Traders interact with a screen-based LOB that summarises all of the “live” (outstanding)
bid and offers that have not yet been cancelled or matched. The LOB has two sides: the
ask book and the bid book. The ask book contains the prices of all outstanding asks, along with the quantity available at each price level, in ascending order. The bid book, on the other hand shows the corresponding information for bids but in descending order; this way traders see the “best” prices at the top of both books. A simplified example of what a trader may see when looking at a LOB is given below.


A simplified example of how a trader would view the LOB shown in Figure 1.1.

The amount of information available about the LOB at any given time depends on the needs and resources of the traders. Usually the only information that is publicly available (in real time) is the last traded price or the mid-price (the point between the current best prices). Professional traders may chose to subscribe to receive information on the price and size for the best prices, along with the price and size of the last recorded transaction, of an asset of interest; this is known as “level 1” market data. The most informative information, “level 2” or “market depth” data, includes the complete contents of the book (except for certain types of hidden orders) but this comes at a premium. For individual subscribers, the current cost of receiving real time level 2 data for equities from just the New York Stock Exchange (NYSE) exchange is $5000/month.

At first glance, the rules or limit order trading seem simple but trading in a LOB is a highly complex optimisation problem. Traders may submit buy and/or sell orders at different times, prices, quantities and – in today’s highly fragmented markets – often to multiple order books. Order may also be modi ed or cancelled at any time. The complexity of LOB strategies presents significant challenges for those attempting to model, understand and predicts behaviours. Nonetheless, the well-de ned framework and the vast volumes of data generated by the use of LOBs presents an exciting and value opportunity for computational modeling…”

What happens inside a quantitative hedge fund?

quantitative hedge fund

Those of us working in academia have a tendency to try and do everything ourselves, often reinventing the wheel in the process. That said, due to the inherent secrecy of the quantitative hedge fund industry, academic researchers in automated trading often do need to do everything themselves. That is: modelling, algorithm design, optimisation, programming, backtesting and simulation.

But how does it work in a quantitative hedge fund? Well, from my (somewhat limited) experience with such funds I’ve found the following general structure to be fairly common.

At the beginning of the investment process you have the quant strategists/researchers. These guys and girls tend to have a PhDs in maths or physics and have a very strong understanding of  probability theory and statistics. Their job is to generate models and ideas that systematically capture investment opportunities and to develop algorithms based on these ideas. These algorithms are usually backtested by the quants and/or a research team.

Once an algorithm is deemed viable, it is passed on to the quant. developers. These developers are highly skilled (often PhD) programmers that focus on optimising the code for speed and robustness before passing it on to the programmers. The programmers’ job lies in interfacing the code with the trading platform and connectivity with the exchanges.

Next in line are the traders who work on unleashing the algorithms into the market at specific times  dependant on prevailing market conditions and instruction from the portfolio manger/s. In quantitative funds this can be quite a low touch job with traders simply launching and babysitting algorithms. However, there are often hundreds of algorithms running at once with traders in charge of allocating capital between them (according to some framework laid out by the portfolio managers).

Naturally, this tends to be more of a circular process. Given that the traders are in direct contact with the markets, they will often will come up with trading ideas and will talk with the quants who will explore, refine and generate algorithms from the ideas as the process starts again.

So you want to predict the stock market…

At least once a week, a Google Scholar Alert pops into my inbox featuring another bored academic that though they’d try using their “expertise” to “predict the stock market”. Said paper tends to follow this structure:

  • Starts with a paragraph about how humans are obsolete and that the market is being run by computers.
  • Next comes the ‘novel’ machine learning technique, which is really just a well known algo. (SVM, neural net., logistic regression) with an adaptive learning rate.
  • Some (usually 5) simple technical analysis indicators are used on unprocessed daily price data to create some features for the new super-algo.
  • Said algorithm is applied to predict whether price(t+1) > price(t) or vice versa.
  • Results report prediction accuracy of ~65% (And we know they tried 100 different stocks before settling on the 3 that they reported results for).
  • There are usually no out of sample results at all!

Some will stop here, reporting that they’ve cracked it and that their technique beats all others. If we’re lucky, however, they’ll go on to assume that they can trade at the close price of each day, buying when they predict upward movement and selling before a fall with a constant size strategy, reporting annual returns of 40%.


Putting aside the lack of out of sample results and non-existant estimation of transaction costs, they’re really missing the point. Attempting to predict the price return over the next 24 hours is not the best way to go about using machine learning to create investment opportunities.

Predicting the direction of price movement is certainly an essential part of automated trading models. But it’s not the only part. A simple automated trading model is better based around event detection: bull, bear or neutral markets, insider trading or institutional trading. Detecting such schemes can easily be approached with either binary classification of hypothesis testing.

In binary classification, a feature matrix, X of size (N*K), is used to predict a binary vector, Y of size N. In this case we would interpret each row as representing one day with K features that we hope relate in some way to the occurrence of the market event that we are trying to predict, Y. Once in this format, binary classification is a simple matter of applying one of the many machine learning algorithms to map each row, Xi to Yi. If our out of sample predictions are consistent, we have a means of quickly identifying regimes and adjusting our trading strategy accordingly. However, given that this kind of binary classification is a standard machine learning problem, it suffers all the pitfalls of overfitting.

For hypothesis testing, we make an assumption about the distribution of an observable variable before applying a test for low probability events. As an example, let’s say we’re into high-frequency volatility trading and we’re looking for unusual activity in a stock. We can describe the number of orders arriving per second with a Poisson distribution, X, and the individual order volume * price with a Gamma distribution, Y. Now we can fit X and Y through maximum likelihood estimation and calculate the probability, p, of the market acting “as normal” each second. Taking p<0.05 as a rare event, if we see more than 5 such events in a 100 second period we might be tempted to make a move. It’s important to bare in mind, however, that hypothesis testing assumes structure in data and thus requires a stationary distribution for decent results.

These kinds of techniques produce far more stable predictions than forecasting daily returns and provide us with information upon which it is much easier to trade.

The difference between automated, algorithmic and high-frequency trading

On telling people that I work on automated and algorithmic trading systems a common response tends to be:

“Oh, that high-frequency trading stuff?”

And I guess that’s because of the current (mostly negative) hype surrounding high-frequency trading. A truthful answer to their question is “sometimes but not always”. In order to explain, let me quickly  review the process of trading (for the sake of brevity I will stick to the buy-side’s point of view).

The diagram above shows the processing of an order from investment decision down to the execution of a trade. In this example, the process begins with an analyst’s idea that leads to a decision to trade. The idea is assessed and approved by the portfolio manager who generates an instruction to buy or sell a certain quantity of asset ‘abc’. The order is then passed on to a trader whose job it is to decide the best approach/broker to use.

Direct-Market access (DMA) represents the client’s ability to access brokers order routing infrastructure, allowing them to issue their orders almost directly to the exchanges. Sponsored access is a step up from DMA, these tend to be ultra low latency, direct connections to the market. In this situation, the client uses their own infrastructure but with the broker’s trading identifier.  Sponsored access is (predominantly) used by clients exploiting high-frequency trading strategies.

Algorithmic trading refers to trade execution strategies that are typically used by fund managers to buy or sell large amounts of assets. They aim to minimise the cost of these transactions under certain risk and timing constraints. Such systems follow preset rules in determining how to execute each order. People often think of these systems as buying low and selling high – making investment decisions – but this is not the case.  Algorithmic trading systems are offered by many brokers and simply execute the orders that they are given. Their job is to get a good price (as compared to various benchmarks) and minimise the impact of trading. This is done by slicing orders and dynamically reacting to market events.

Of course there are algorithms that deal with investment decision making and this is where automated trading comes in.

Automated trading, often confused with algorithmic trading, is the complete automation of the quantitative trading process. Thus, automated trading must encapsulate: quantitative modelling and indicator tracking to determine trade initiation and closeout; monitoring of portfolio risk; as well as algorithmic trading. This type of trading tends to be done by quantitative hedge funds that use propriety execution algorithms and trade via DMA or sponsored access.

High-frequency trading (HFT) is a subset of automated trading. Here, opportunities are sought and taken advantage of on very small timescales from milliseconds up to hours. Some high-frequency strategies adopt a market maker type role, attempting to keep a relatively neutral position and proving liquidity (most of the time) while taking advantage of any price discrepancies. Other strategies invoke methods from time series analysis, machine learning and artificial intelligence to predict movements and isolate trends among the masses of data. Specifics of the strategy aside, for HFT, monitoring the overall inventory risk and incorporating this information into pricing/trading decisions is always vital.

See, it’s simple!!

Just remember, algorithmic trading should have been called “algorithmic execution”; automated trading does what it says on the tin; and HFT is a specific type of ultra-fast automated trading.