Frequently with search based, big data projects the problem of content duplication is an obstacle to having a clean data source. Here’s an approach to improving the data quality by training a classifier to spot duplicates.
The Problem
The data set has about 470,000 non-unique hotel descriptions (e.g. name, metadata, images) provided from 11 bed bank services. These are received in various different feed formats, and in this data is hotel name, resort name, address, country, description and image data, among other data.
Across different bed banks the hotel descriptions are duplicated (but not identically) as a given hotel may be sold/booked through multiple providers. Each provider has individually entered the details for each hotel. Also within a single bed bank they’re often duplicated, for nefarious purposes – in order to ensure they appear as a separate result where before they would be merged with competitors. Because of this there is the need to identify existing duplicates in the data and further any new hotels to match and ensure new duplicates are correctly identified.
There is no guarantee the details on the hotels match exactly – often hotel names will vary on whether they include “Hotel” or brands e.g. “Sheraton” in the hotel name, resorts are often called different things, and address vary in level of detail and spellings.
As an example of a duplicate:
field | hotel 1 | hotel 2 |
---|---|---|
hotel_name | NH Express San Pedro | NH San Pedro |
resort_name | Malaga | San Pedro de Alcantara |
address | Jerez, 1 – 29670, San Pedro | Jerez, 1 |
Costa Del Sol | Marbella | |
29670 | Costa del Sol | |
Mainland Spain | ||
29670 | ||
country | ES | ES |
To the human eye these are obviously the same hotel. But the only field that matches exactly is the country field. Hotel name almost matches, resort name not at all and address does in some portions but not in others. Some parts of the match are more significant than others – address containing “Costa Del Sol” is less significant than it containing a common postcode “29670”. There are hundreds of hotels in Costa Del Sol, but far less with this postcode, so it’d be useful in matching that we take more significance from the postcode.
Also across the fields some fields are more important if similar than others – you can be fairly confident on a match if hotel name matches, but there’s less significance to resort_name matching (or not as in this case).
Given hotel 1, the challenge is how to get a computer pick out hotel 2 out of the full database of 470,000 hotel records, and reliably not pick similar hotels in the same resort, or similar addresses that are not the same hotel (a false positive).
A database would be wholly inappropriate for matching like this – databases work on exact values, or at best prefixes. The obvious tool to store the hotel details is a search engine – my favourite elasticsearch. :-)
The Algorithm
The matching algorithm I used was hitting a search index with a query containing these 4 incoming fields each weighted in a bool search. With IDF (inverse document frequency) term weighting the search engine takes care of assigning significance to particular terms in a field (so a rare postcode contributes significantly to score, a more common term like ‘Hotel’ contributes very little).
With the search results (which can be many) we then pick the highest scorer (closest match) and if over a certain threshold the hotels are categorized as duplicates and associated entries recorded in join tables.
So this gives 5 parameters set: boosts for hotel_name, resort_name, address and country and a threshold cutoff for the score to qualify as a duplicate.
Training
A set of data manually matched by humans formed the training data – this was provided from the previous provider. This training data is assumed to be high quality.
The training data was split into two sets – one for training and one for verification. The validation set is only used to evaluate final accuracy – it is held back, and the system is never trained on it directly so it remains independent. This avoids overfitting the training data, or overestimating how well the algorithm is doing.
Setting the 5 parameters by hand is likely to be a fairly tedious process as it’s a five dimensional problem and may not necessarily yield the best results. Plus any changes in algorithms or bug fixes may require retraining which would then require hand tuning again. The problem needs reformulating as minimising a function, so it becomes a well understood numerical analysis problem. Many linear programming algorithms require a partial differential of the function to do gradient descent; this is not an option since the search engine scoring is a complex problem, so the simplex algorithm was picked as this doesn’t require a partial differential. scipy (the python scientific functions library) provides a good implementation under scipy.optimize.fmin to optimize n-dimensional problems.
The last requirement is a scoring function to evaluate how good a set of parameters is on the training data. We run over the training data and count all the true positives (correct dups), false positives (dups that weren’t dups), true negatives (correct not dups) and false negatives (incorrect non- dups). With this confusion matrix there’s a few ways to rate how good the results are – I played with precision, recall, accuracy (a simple combination of precision and recall), and F scores (a more representative way of combining precision and recall). F scores are often the best way of evaluating a classification problem – if you evaluate on just precision the training will yield poor recall and tend to bias towards the extremes of false positives/negatives – so overly cautious on classification. Since we’re biased against false positives (we would rather classify a hotel as a non-duplicate, then wrongly classify is a duplicate of one which it is not), the F0.25 score seemed a reasonable choice – this puts a higher weight on precision than recall.
See http://en.wikipedia.org/wiki/F1_score for a full explanation.
Maximisation of F score => minimisation of -F.
Scaling up compute resources
The process of training is a compute intensive one – it involves many iterations of the evaluation of the parameters against the substantial training set. To scale this I took advantage of EC2 spot instances to spin up 5x m1.large instances that would join the elasticsearch cluster and replicate the index in order to boost query performance. This brought the time of each iteration down to about 20s, and so the whole training process could be run in a manageable 20-30mins.
Results
The classifier arrived at these parameters:
parameter | value |
---|---|
min_score | 5.311 |
hotel_name | 4.574 |
resort_name | 0.315 |
address | 0.598 |
country | 0.393 |
Intuitively these look to be roughly what you’d expect – the hotel name is weighted with the greatest importance, then address, country and finally resort name. It is surprising that hotel name is weighted so much higher, but this is perhaps down to expected number of matching terms – in an address there are likely to be many more terms in common, so greater aggregate score than a hotel name that may just have one or two words.
The evaluation of these parameters against the validation data produces these results:
. | actual | |
---|---|---|
predicted | True | False |
True | 4131 | 72 |
False | 338 | 156 |
Metric | Score |
---|---|
Precision: | 0.983 |
Recall: | 0.924 |
F1 score: | 0.953 |
F.25 score: | 0.979 |
Accuracy: | 0.913 |
MCC: | 0.426 |
The 98.3% precision is very good, and the 92.4% recall reasonable. Since it was biased towards precision this is the expected trade-off. Optimising against unbiased F1 produced precision: 97.3% and Recall 97.5%, so a 5% increase in recall produces a corresponding 1% increase in number of duplicates being incorrectly classified positively as duplicates when they are not, which was undesirable.
Overall, the classifier training was successful, and has been in use in production for some time now.