I love my Job I get to play with large amounts of data and some cool new cutting edge cloud based toys such as Map Reduce and Mahout and some interesting Web 2.0 Machine learning and AI type algorithms
Map Reduce is a software frame work developed by Google to allow processing on large datasets on clusters of commodity computers. Though in an odd coincidence the Map stage of map reduce is effectively the same approach we used at Telecom Gold to handle processing the Large logs in the Telecom Gold Billing system with a system called GLE Generic Log Extract (written in PL1).
After some hacking I have got a small test cluster up and running to try out Map reduce for some interesting work on clustering documents, in this case web pages on some well known large websites.
I was having some difficulty in getting Mahout which is an open source set of algorithms to perform clustering of documents using map reduce – when almost by chance I found that out parent company has its own system HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems that Map Reduce is used for.
HPCC used to be just an internal system developed by Lexis Nexis and has been used for lexis nexis customers for the past decade. But recently ie last week HPCC has been open sourced. As with Hadoop there is a web based interface
Also there is a windows IDE which directly connects to a HPCC cluster to allow you to run ECL which is the declarative non procedural language used to program jobs to be run on your HPPC cluster.