Deploying a scikit-learn classifier to production

Scikit-learn is a great python library for all sorts of machine learning algorithms, and really well documented for the model development side of things. But once you have a trained classifier and are ready to run it in production, how do you go about doing this? There’s a few managed services that will do it for you, but for my situation these weren’t a good fit. We just wanted to deploy the model onto a modest sized Digital Ocean instance, as a REST API that can be externally queried. [Read More]

Finding related searches with Spark

Unsupervised learning from user behaviour When a user navigates a site they leave a valuable trail of information - what their first search was, what they followed this search with, and so on. Using this data we can learn related searches automatically by co-occurrence counting. This post takes you through the steps to get from raw search logs to results using the Spark cluster computing framework. Spark provides a natural processing language for flows of data, and can be scaled up to clusters when data growth dictates. [Read More]

A duplicate classifier using elasticsearch

Frequently with search based, big data projects the problem of content duplication is an obstacle to having a clean data source. Here’s an approach to improving the data quality by training a classifier to spot duplicates. The Problem The data set has about 470,000 non-unique hotel descriptions (e.g. name, metadata, images) provided from 11 bed bank services. These are received in various different feed formats, and in this data is hotel name, resort name, address, country, description and image data, among other data. [Read More]