Kibana baby kick counter - part 2

This is part 2 of 2 about using Elasticsearch and Kibana to track patterns in baby activity. Part 1 here covers the hardware and setup for tracking baby kicks. Machine learning Having collected about 6 weeks of baby kicking data, it’s time to test the new toy in the Elasticsearch stack: Machine learning. Installing this was a straightforward case of following the instructions, and from the ‘Machine Learning’ new menu item in Kibana, I chose ‘Create new job’, and ‘Create a single metric job’: [Read More]

Kibana baby kick counter

This is part 1 of 2 about using Elasticsearch and Kibana to track patterns in baby activity. Part 2 is here. Kicking things off According to countthekicks.org “Counting baby kicks is important because changes in your baby’s movement pattern may indicate potential problems with your pregnancy”. Counting and patterns sounds like a technological problem for Elasticsearch and Kibana! There’s a few apps dedicated to tracking kicks (over 20 apps last count on Android), but they’re mostly crap (lacking detailed history or useful insights) and being a data geek I’m not entirely happy handing over the data without a definite means to export it. [Read More]

Elasticsearch as a smart cache

A novel use of Elasticsearch in the context of holiday search, not as a traditional store of persistent documents, but as a rolling cache of transient holiday packages. This puts unique demands on Elasticsearch in terms of index and deletion rate, concurrent to a high query rate. Figures below are from February 2014, so will have changed. Fancy a holiday? The travel industry works around the principal of ‘dynamic packaging’ of holidays. [Read More]

Betting on the Twitter stream with Elasticsearch and Kibana

Sentiment analysis, Elasticsearch and Kibana The idea From the Twitter streaming API grab tweets for a live TV show whilst it’s showing, classify by contestant, analyze sentiment and provide a real-time dashboard into the outlook. Try to predict good bets on the winners. Optional: Gamble. :-) The tools Grabbing the data: Scala and the twitter4j Java library Sentiment analysis: SentiStrength Search engine: Elasticsearch Dashboard: Kibana (also from the Elasticsearch guys) The show I picked the final of the BBC’s “Strictly Come Dancing” in the UK. [Read More]

Scala Future and the Elasticsearch Java API

Most examples of the Elasticsearch Java API you’ll have seen follow the prepare/execute/actionGet pattern (in Scala): val response = client.prepareIndex("twitter", "tweet", "1") .setSource(jsonBuilder() .startObject() .field("user", "kimchy") .field("postDate", new Date()) .field("message", "trying out Elastic Search") .endObject() ) .execute() .actionGet() This blocks in actionGet(). If you’re developing in a reactive environment (eg. Akka or Play), blocking is a no no. And even if you’re not, it’s a win to be able to avoid blocking so you can parallelize indexing to eek out the best performance. [Read More]

Spark and Elasticsearch

Elastic Sparkle If you work in the Hadoop world and have not yet heard of Spark, drop everything and go check it out. It’s a really powerful, intuitive and fast map/reduce system (and some). Where it beats Hadoop/Pig/Hive hands down is it’s not a massive stack of quirky DSLs built on top of layers of clunky Java abstractions - it’s a simple, pure Scala functional DSL with all the flexibility and succinctness of Scala. [Read More]

Introducing elasticsearch*

Just opensourced a starcluster plugin for provisioning elasticsearch clusters automatically in the cloud. What is Starcluster? Starcluster is great for spinning up clusters of ec2 nodes quickly for some analytics / data chomping. If you’re not familiar take a look at this very cool screencast: Starcluster and Elasticsearch = best friends Elasticsearch is a great fit - it runs smoothly on ec2 and for ease of searching big data is second to none. [Read More]

A duplicate classifier using elasticsearch

Frequently with search based, big data projects the problem of content duplication is an obstacle to having a clean data source. Here’s an approach to improving the data quality by training a classifier to spot duplicates. The Problem The data set has about 470,000 non-unique hotel descriptions (e.g. name, metadata, images) provided from 11 bed bank services. These are received in various different feed formats, and in this data is hotel name, resort name, address, country, description and image data, among other data. [Read More]