jump to navigation

Setting up a home hadoop lab #1 starting out December 13, 2013

Posted by Neuromancer in hadoop.
Tags:
3 comments

I finally bit the bullet and brought some new hard ware with the aim  of setting up a small home lab  for the purposes of learning hadoop. I set up a small 3 machine cluster whist at RBI (part of reed Elsevier) to have a look at ML and AI and As I have the time was interested in learning hadoop especially as in the past I worked on a billing system that used the map reduce programing paradime on PR1ME 750′s. (dont knock it back then 17 750′s running together was not to be sniffed at) .

r82r3196m

A Prime 750

There are lots of sites dedicated to building a home labs for Vmware/vshphere  and Cisco but none for hadoop so I thought I ought to document the process here,

After debating finishing of my antec 900 build and running on that using vmware player  I decided to use a popular microserver from  HP as they where cheap around £120 for the base machine there is a £50 rebate until the end of dec.

hp54gl-server

It out of the box can take 4 stata drives and with tweaks can take two more in the top 5.25 bay a 3.5 and 2.5 inch in a suitable adapter.

The main aim is to run a small lab cluster of virtual hadoop machines so that I can both get to grips with hadoop and MR but also so I can get my hands dirty (instead of just uing a single vm) and  understand how it all works under the hood and play with some of the newer hadoop features such as NN HA .

Yes I know that hadoops performance under virtual machines isn’t going to be very good this is primarily a learning environment not one you would use for doing useful work.

I have the machine booted and after a struggle  managed to create a bootable ESXi usb stick. I need to by a bigger USB stick as I want to install the hypervisor on that leaving all the disks free for VM’s.

The next step is to buy more memory to take the system up to its max of 16GB and some more DASD (disk drives for you civilians)

Problems with Googles parsing of in google sitemaps August 24, 2012

Posted by Neuromancer in SEO.
Tags: ,
add a comment

Found an interesting problem with XML Sitemaps and the way Google seems to handle the lastmod time. The sitemap protocol uses W3C Time as the standard. I thought I would write this up and put this out there.

I was seeing errors in GWT for one of our sites for a recently updated sitemap. GWT was complaining and about errors in date time formatting. If we look at the code (original site redacted)

<sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9“>
<sitemap>
<loc>http://www.example.com/site-map-type-subtype-pages-1.xml</loc>
<lastmod>2012-08-17T10:39:18.6685044Z</lastmod>
</sitemap>
<sitemap>
.
.
.
This all looks fine and validates both with w3c and xmllint – the only thing I could see that looked “hinky” is the use of the fractional seconds this sitemap index use last modified dates with fractional seconds to 6 decimal places i.e. to the nearest millionth of a second.

2012-08-16T07:58:15.5396524Z

This is fine and is perfectly valid, unfortunately reading  the small print of the time standard your supposed to use for last modified date – the fractional second part of is not strictly defined

“This profile does not specify how many digits may be used to represent the decimal fraction of a second. An adopting standard that permits fractions of a second must specify both the minimum number of digits (a number greater than or equal to one) and the maximum number of digits (the maximum may be stated to be “unlimited”).”

Unfortutely the sitemap protocol does not define the number of digits used for fractional seconds and Google is reporting invalid date errors when it parses the date time. In fact it doesn’t seem to like shorter usage of fractional seconds at all.

An example of sloppy work in defining the sitemap standard and to a lesser extent Google’s in not handling a  unlimited number of  digits – which as the sitemap standard is silent is the obvious fail safe  assumption they should have made.

I do wonder how some of today’s  crop of PFP  (pimply Faced Programmers) would cope with full on seven layer OSI based systems – I wonder if Vint has any old OSI blue books to act as an example of how-to write  more stringent standards .  (and yes I am aware of the pitfalls of the OSI model of standards making)

Link builder Chutzpah or Spearlink Spam Attack July 20, 2012

Posted by Neuromancer in SEO.
add a comment

I work for RBI in the inhouse SEO team and we recently received a email that was part of a link building campaign they had obviously scraped on of our large sites ICIS ( a Big Chemical And Energy Industry site) and found that we had linked to a university article on Bio fuels which had subsequently been taken down.

They then produced a page with similar information to the now 404 page and suggested that we could replace this broken link with their link which was on a car insurance site :-)

From: XXXXX XXXXXX [mailto:XXXX.XXXXX17@gmail.com] 
Sent: 18 July 2012 12:43
To: 
Subject: Broken link on your page

Hi ,

I came across your website and wanted to notify you about a broken link on your page in case you weren't aware of it. The link on http://www.icis.com/blogs/biofuels/archives/biodiesel which links to http://www.example.edu/p2/biodiesel/article_alge.html is no longer working.

I've included a link to a useful page on biodiesel that you could replace the broken link with if you're interested in updating your site. Thanks for providing a great resource!

Link: http://www.example.org/algae-solutions

Best,

XXXX

Certainly this link builder wins an award for Shear Chutzpah – this approach to link building is similar to a spear fishing attack where an email is targeted directly at a specific individual and tailored to the interests of the victim/mark – which is why I dubbed this a Spearlink attack.

Unfortunately the recipient immediately realised that this looked dodgy. An attack on a government site or one with less savvy staff this attack could have easily succeed.

Its also interesting that they used a blog and not an article on the main site maybe the rss feed for the mt blog was used – rather than a crawl of the main site.

Googles April Algorithem Changes Panda3.5 19 April and Penguin 24th April April 30, 2012

Posted by Neuromancer in SEO.
Tags: ,
add a comment

Google have released an number of different algorithm update this month the following is a brief description of the changes Google have made in April. I did a brief summary for our internal users – and thought it might be useful for the wider internet.

Penguin  24th April

Googles latest update the Penguin update launched on April 24. It was a change to Google’s search results that was designed to remove  pages that have been spamming Google. Spamming in this case is where  people do things like “keyword stuffing”, “hiding text” or “cloaking” that violate Google’s guidelines.

Panda 3.5 19th April

On the 19th an update of the Panda algorithm was launched. Panda is an algorithm designed to promote higher quality pages over  lower quality sites.

Parked Domains Problem April 17th

Google also made a rare admission that it made a mistake – on the 17th April they had a problem that was incorrectly identifying sites as parked domains. A parked domain is one that you own but has no content apart from a holding page.

This is an executive summary of a longer post at search engine land here

Best Adsense Fail or Scary Devil Nunnery Recruiting – and a SEO Fail on jobs.guardian.co.uk September 23, 2011

Posted by Neuromancer in SEO.
Tags:
add a comment

Whilst perusing the Guardians job  section to analyse the platform they use – I both found a number of ways to completely mess up that entire section of the site and I also found what must be the strangest Adsense advert of all time.

Having managed to create arbitrary pages on the job site I took a look at the Adsense served up at the base of the page which is show here (note the faked page I created was IT related).

Strange Ad Sense Advert

Though I must say holding SEO Audits in the style of the “Congregation for the Doctrine of the Faith” does appeal some times – especially when one comes across pages whose markup could be best described as “Your aving a laugh mate”. Though I suspect that HR might winge when we took people down to the basement for the “shewing of the instruments “.

HPCC – High Performance Computer Cluster Open Sourced June 28, 2011

Posted by Neuromancer in HPCC.
1 comment so far

I love my Job I get to play with large amounts of data and some cool new cutting edge cloud based toys such as Map Reduce and Mahout and some interesting Web 2.0 Machine learning and AI type algorithms

 Map Reduce is a software frame work developed by Google to allow processing on large datasets on clusters of commodity computers. Though in an odd coincidence the Map stage of map reduce is effectively the same approach we used at Telecom Gold to handle processing the Large logs in the Telecom Gold Billing system with a system called GLE Generic Log Extract (written in PL1).

After some hacking I have got a small test cluster up and running to try out Map reduce for some interesting work on clustering documents, in this case web pages on some well known large websites.

I was having some difficulty  in getting Mahout which is an open source set of algorithms to perform clustering of documents using map reduce – when almost by chance I found that out parent company has its own system HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems that Map Reduce is used for.

HPCC used to be just an internal system developed by Lexis Nexis  and has been used for lexis nexis customers for the past decade. But recently ie last week HPCC has been open sourced. As with Hadoop there is a web based interface

ECL Watch the Web front end for HPCC

ECL Watch the Web front end for HPCC

Also there is a windows IDE which directly connects to a HPCC cluster to allow you to run ECL which is the declarative non procedural language used to program jobs to be run on your HPPC cluster.

There is a test virtual machine available for down load here  to allow people to test HPCC and learn ECL here binaries for Centos and Red Hat are avaible  and source should be available in a few weeks.

Panda Update Hits the UK and All English Queries April 11, 2011

Posted by Neuromancer in SEO.
Tags: , ,
1 comment so far

It looks like the infamous Panda Google update has arrived outside of the USA.  According to Google the Panda update (some times called Farmer/Panda) is meant to better  identify low-quality pages and sites.

These are the sort of pages (often seen on “content farms”) with text that is automatically tuned to match the query – but may not provide the best user experience. (Google apparently calls it a “high quality sites algorithm.)

I am due to help give a presentation on SEO to a group of RBI’s developers on Wednesday – so guess whos going to be quickly revamping the presentation deck tomorrow – as well as groveling in the Analytics data to see if any of our sites have been hit.

Though its  my boss that will be fielding the calls from the senior management I am glad to say. Coverage here and Googles own blog here

New Funtionality in GWT Non Informative Title Tags and Non Indexable Content March 23, 2011

Posted by Neuromancer in SEO.
add a comment

Google have just lanched some new funtionality in GWT (Google Webmaster tools)  two new items in the  html sugestions:  Non Informative Title Tags and Non Indexable Content

Non Informative Title Tags and Non Indexable Content

Could be usefull in diagnosing problems in sites that need fixing – espesialy as a non informative title tag is a big low quality signal.

Steam Punk Sara Palin February 9, 2011

Posted by Neuromancer in Uncategorized.
Tags: , ,
1 comment so far

Comics or graphic novels if we are being pretentious have had some odd one offs and crossovers -and recently a genre which mixes science with Jules Verne HG Wells era SF called steam punk has become popular.

Today  browsing some gawker  properties to see if they have fixed the major javascript snafu they had.

And what did I find…. Drum roll please! Ladies and Gentlemen I give you Steam Punk Sarah Palin.

steam punk sara palin

One reviewer commented

Steampunk Palin defies classification into any literary genre, unless there’s a genre I’m unaware of simply called “WTF?!?

It seems to be in the so bad its good territory I cant wait for the film.  A Review is here

and thus been fairly or unfairly labelled as “Goths who decided it might be fun to wear brown for a change.” 

Read More: http://www.comicsalliance.com/2011/01/20/steampunk-palin-comic/#ixzz1DV4hmGa9

RIP Gladys Horton of The Marvelettes February 2, 2011

Posted by Neuromancer in Music.
Tags: , ,
add a comment

Sad to see that Gladys Horton one the founders of the Marvelettes has recently passed away. I saw her obit in the Guardian the other day. I thought I should post up a link to one of my favorite Marvelettes tracks for an early Harlem Apollo show in 63.

As you can see they where doing the moonwalk years before Micheal Jackson and in high heels!

Bing Copying – Google Throws Toys out of Pram February 1, 2011

Posted by Neuromancer in SEO.
add a comment

Oh dear sounds like Google is getting upset over Bing using googles results to improve theirs from the write up on Searchengine land here.

Google has run a sting operation that it says proves Bing has been watching what people search for on Google, the sites they select from Google’s results, then uses that information to improve Bing’s own search listings. Bing doesn’t deny this.

Reverse engineering is legal other wise we would still all be using IBM PC’s – Google should just man up and take it as a compliment.

 

Black Templar Space Marines WH40K January 27, 2011

Posted by Neuromancer in War Games.
1 comment so far

I have been thinking about doing some WH40K gaming and one a visit to the mother ship at warhammer world I finaly broke down and brought a space marine battle force which is a basic starter set. I also looked at the various different  Space marine factions and was taken with the Black Templars.

Black Templar Marine

Black Templar Marine

The Black Templars are a non standard Chapter of Space marines who deviate in a number of ways from the standard, Most of the time they fight in Companies which are formed in an ad hoc manner. The individual squads and specialists fight side by side out of familiarity and comradeship rather than any imposed organisation.

They also have a stark black and white colour scheme which appealed – though when I looked at the difficulty of painting black armour – I did wonder if I had bitten off more than I could chew.  However I have made a start undercoating a 10 men unit and am in the process on painting them up ill post some pics when I dig out my camera.  Though i might not get to this standard.Black Templar Minature

I am just at the stage of doing the white shoulder pads which require multiple coats of grey as a second undercoat  so that the white stands out against the back under coat.

I also went to GW in London and Bedford and brought some more a few days later and have some vehicle models which will be done later.


black tempar rhino black templar preadator

So left to right  we have shots of a rhino APC: a Razorback MICV and lastly a Predator tank when they are made up and painted in the chapter colors.  My Local GW Shop in Bedford is here

Quora January 8, 2011

Posted by Neuromancer in SEO.
add a comment

Just been playing with a new site Quora that is the new hotness :-)

Basically its a site where you can post and answer questions – they describe it as:

Quora is a continually improving collection of questions and answers created, edited, and organized by everyone who uses it.

You can see my Quora profile here

3000 Point AT43 Game August 15, 2010

Posted by Neuromancer in SEO.
add a comment

Pics from my recent AT43 Game

Search Marketing Pro – SEP 2010 August 6, 2010

Posted by Neuromancer in SEO.
2 comments

Beer + Search geeks = a good time!

As a standin for the on hiatus SEO London my Boss is aranging a Networking event for SEO’s/People working in Search in London. At the moment its in the advanced cat hearding stage there is an Info Page for Search Marketing Pro Fill in the survay and we hope to see you there.

Follow

Get every new post delivered to your Inbox.

Join 390 other followers

%d bloggers like this: