Sentiment analysis, Elasticsearch and Kibana The idea From the Twitter streaming API grab tweets for a live TV show whilst it’s showing, classify by contestant, analyze sentiment and provide a real-time dashboard into the outlook.
Try to predict good bets on the winners. Optional: Gamble. :-)
The tools Grabbing the data: Scala and the twitter4j Java library Sentiment analysis: SentiStrength Search engine: Elasticsearch Dashboard: Kibana (also from the Elasticsearch guys) The show I picked the final of the BBC’s “Strictly Come Dancing” in the UK.
[Read More]
Dynamic DNS for EC2 instances
The problem Booting instances on EC2 is easy. But once it comes to connecting to them, you end up having to copy around transient and unwieldy ec2-....compute.amazonaws.com host names.
So you manually set up a CNAME in your DNS to give the instance a friendly name, but as soon as you stop and restart the instance, the ec2 name has changed so your DNS record needs updating.
Rapidly gets tedious!
[Read More]
Finding related searches with Spark
Unsupervised learning from user behaviour When a user navigates a site they leave a valuable trail of information - what their first search was, what they followed this search with, and so on. Using this data we can learn related searches automatically by co-occurrence counting.
This post takes you through the steps to get from raw search logs to results using the Spark cluster computing framework.
Spark provides a natural processing language for flows of data, and can be scaled up to clusters when data growth dictates.
[Read More]
Scala Future and the Elasticsearch Java API
Most examples of the Elasticsearch Java API you’ll have seen follow the prepare/execute/actionGet pattern (in Scala):
val response = client.prepareIndex("twitter", "tweet", "1") .setSource(jsonBuilder() .startObject() .field("user", "kimchy") .field("postDate", new Date()) .field("message", "trying out Elastic Search") .endObject() ) .execute() .actionGet() This blocks in actionGet().
If you’re developing in a reactive environment (eg. Akka or Play), blocking is a no no. And even if you’re not, it’s a win to be able to avoid blocking so you can parallelize indexing to eek out the best performance.
[Read More]
Spark and Elasticsearch
Elastic Sparkle If you work in the Hadoop world and have not yet heard of Spark, drop everything and go check it out. It’s a really powerful, intuitive and fast map/reduce system (and some).
Where it beats Hadoop/Pig/Hive hands down is it’s not a massive stack of quirky DSLs built on top of layers of clunky Java abstractions - it’s a simple, pure Scala functional DSL with all the flexibility and succinctness of Scala.
[Read More]
Migrating from GoDaddy to Amazon Route53
Due to the recent outage on GoDaddy a lot of people are reconsidering their DNS options. Amazon Route53 is a great option - cheap, flexible and well proven.
To migrate you first need to export a zone file for the domain from GoDaddy.
It’s been highlighted the zone files are slightly broken in CNAME records, so you may need run this fix over them:
$ perl -pe 's/(CNAME .+)(?!.)$/$1./i' broken.txt > fixed.
[Read More]
A duplicate classifier using elasticsearch
Frequently with search based, big data projects the problem of content duplication is an obstacle to having a clean data source. Here’s an approach to improving the data quality by training a classifier to spot duplicates.
The Problem The data set has about 470,000 non-unique hotel descriptions (e.g. name, metadata, images) provided from 11 bed bank services. These are received in various different feed formats, and in this data is hotel name, resort name, address, country, description and image data, among other data.
[Read More]
iboto screencast
Made a screencast (my first!) of iboto to give a demo of the Amazon EC2 multi-account coolness:
Incidentally, making screencasts on Linux was a bit of a slog until I found the right tools and workarounds, so that might make the subject of a blog post of its own.
Route 53 latency based routing
Amazon have launched a neat new Route 53 feature: latency-based routing. The idea behind this is when someone hits www.yoursite.com this resolves to the closest server to them, cutting latency.
This DNS cleverness has been used by the big boys for some time, but not been available to us mortals without shelling out big bucks to someone like neustar/ultradns (shudder).
The ’location’ of your server is determined by multiple DNS records for a given lookup, each with an EC2 region attached to them (us-east-1, eu-west-1, etc.
[Read More]
Hosting a jekyll blog on Amazon S3
** Note: I’ve replaced jekyll with the equally adapt pelican now. **
This article describes how to host your own static blog/site on S3. It revolves around the evolution this site has taken.
First off I started using github’s public site feature. Dead neat, nice set of features and so quick to get running. The problem is I’m impatient, and often after pushing an update the “Page build successful” can take upwards of 30 minutes.
[Read More]