I’m going to take you on a journey of optimization to show you how to optimize a web crawler, image recognition, and machine learning model for speed. For this we need a hypothetical problem.
The challenge
Link to heading
There’s a website where tickets to see your favourite artist are released daily at 3 pm. But every time you try, you are too slow. You need to be the first; how fast could you optimize your process if you fully automated it?
You refresh a web page at precisely 3 pm, click through a page to select the date, another page to choose your seat at the concert, and finally fill in a captcha to submit your booking. Each page must be visited, you cannot skip to the end of this process.
Analysis
Link to heading
We need to understand where to speed up to optimize our chances of getting a ticket.
-
Network Latency – Let’s say the server is hosted in the US, and we’re in London. Time to make a round trip to the server: 100ms x 4 requests
-
Page load – Time to fully load a single page in Chrome (dom ready): 80ms x 3 pages (the last page loading isn’t time sensitive as we’ve already submitted the captcha)
-
Protocols – 3-way TCP handshake, SSL, and DNS lookup: 350ms.
-
Entering captcha – Manually typing 8 characters: 2,500ms
-
Submitting each page – Clicking buttons on a page: 1000ms per page.
-
Total time ~6,500ms
The initial optimization – the low-hanging fruit.
Link to heading
The best way to improve our chances is to focus on the largest area of time, which would take the least effort to reduce. For us, that is the human interaction part of clicking the buttons on a web page.
Submitting the form – From 3000ms to 60ms
Link to heading
Optimizing clicking through the web pages is an easy place to start. We can automate moving through the pages with minimal scripting using an extension like Greasemonkey. It can reduce each page interaction from 1000ms to 20ms. There is an additional – but hard-to-measure – benefit of scripting, which is refreshing the page at exactly the right time the appointment is released instead of manually trying to hit the refresh button at exactly 3 pm (well, 3 pm minus half the round trip latency of 50ms)
Protocols DNS/SSL/TCP 3-way Handshake – 350ms to 0ms
Link to heading
Although protocols are a tiny proportion of the overall time at around 350ms, reducing their impact is straightforward. In the browser, we can cache the DNS. We can also ensure the connection is open long enough to prevent repeating the 3-way handshake and ssl process. It stays open for about 60 seconds, so making a request 30 seconds before helps ensure the connection remains open, eliminating 350ms of potential latency.
Initial optimization run.
Link to heading
We’ve saved 3,290ms – or three seconds – suitable for general optimization, but we can go much faster.
Second attempt – Decoding the image
Link to heading
The ticket booking process requires filling in a captcha to secure your ticket. This should be the next focus for optimization as it takes the next most significant amount of time.
Above is the captcha – an 8-character randomly generated string of alphanumeric characters placed on a static background. There is only one way to decode a captcha rapidly and consistently: Teach a machine to do it for you.
Here’s the process we’ll follow to do that: Take the captcha, split apart the letters from the background, then convert the image into individual letters to send to a neural net to make a prediction.
Training data
Link to heading
To train a machine learning model on this captcha, you’ll need around 50,000 labelled images. Whilst you can get many captchas from the ticket website, the process won’t scale because every image needs to be labelled, so renaming the file from download.png
to the characters it contains, e.g. 8ah3muxe.png
. To manually label it would