Google Webmaster Tools Using The New API Fixed September 11, 2014Posted by Neuromancer in SEO.
add a comment
Whist experimenting at work with the new GWT Api I found that the python example google provides has a bug – as well as some rather opaque documentation that misses out a crucial step. When you set up your credentials you also have to make sure that the consent screen section is fillout with at leat the project name and a suport email!
Updated script follows
# --- MJW Fixed Bugs
logging.basicConfig() # added to fix warning error
from apiclient import errors
from apiclient.discovery import build
from oauth2client.client import OAuth2WebServerFlow
# Copy your credentials from the console rember to update the consent screen with a name and a suport email
CLIENT_ID = 'xxxxx'
CLIENT_SECRET = 'yyyy'
# Check https://developers.google.com/webmaster-tools/v3/ for all available scopes
OAUTH_SCOPE = 'https://www.googleapis.com/auth/webmasters.readonly'
# Redirect URI for installed apps
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
# Run through the OAuth flow and retrieve credentials
flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, REDIRECT_URI)
authorize_url = flow.step1_get_authorize_url()
print 'Go to the following link in your browser: ' + authorize_url
code = raw_input('Enter verification code: ').strip()
credentials = flow.step2_exchange(code)
# Create an httplib2.Http object and authorize it with our credentials
http = httplib2.Http()
http = credentials.authorize(http)
webmasters_service = build('webmasters', 'v3', http=http)
# Retrieve list of websites in account
site_list = webmasters_service.sites().list().execute()
# Remove all unverified sites - note chage origioanl uses url and not siteURL
verified_sites_urls = [s['siteUrl'] for s in site_list['siteEntry'] if s['permissionLevel'] != 'siteUnverifiedUser']
# Printing the urls of all sites you are verified for.
for site_url in verified_sites_urls:
# Retrieve list of sitemaps submitted
sitemaps = webmasters_service.sitemaps().list(siteUrl=site_url).execute()
if 'sitemap' in sitemaps:
sitemap_urls = [s['path'] for s in sitemaps['sitemap']]
print " " + "\n ".join(sitemap_urls)
Setting up a home hadoop lab #1 starting out December 13, 2013Posted by Neuromancer in hadoop.
I finally bit the bullet and brought some new hard ware with the aim of setting up a small home lab for the purposes of learning hadoop. I set up a small 3 machine cluster whist at RBI (part of reed Elsevier) to have a look at ML and AI and As I have the time was interested in learning hadoop especially as in the past I worked on a billing system that used the map reduce programing paradime on PR1ME 750’s. (dont knock it back then 17 750’s running together was not to be sniffed at) .
There are lots of sites dedicated to building a home labs for Vmware/vshphere and Cisco but none for hadoop so I thought I ought to document the process here,
After debating finishing of my antec 900 build and running on that using vmware player I decided to use a popular microserver from HP as they where cheap around £120 for the base machine there is a £50 rebate until the end of dec.
It out of the box can take 4 stata drives and with tweaks can take two more in the top 5.25 bay a 3.5 and 2.5 inch in a suitable adapter.
The main aim is to run a small lab cluster of virtual hadoop machines so that I can both get to grips with hadoop and MR but also so I can get my hands dirty (instead of just uing a single vm) and understand how it all works under the hood and play with some of the newer hadoop features such as NN HA .
Yes I know that hadoops performance under virtual machines isn’t going to be very good this is primarily a learning environment not one you would use for doing useful work.
I have the machine booted and after a struggle managed to create a bootable ESXi usb stick. I need to by a bigger USB stick as I want to install the hypervisor on that leaving all the disks free for VM’s.
The next step is to buy more memory to take the system up to its max of 16GB and some more DASD (disk drives for you civilians)
Problems with Googles parsing of in google sitemaps August 24, 2012Posted by Neuromancer in SEO.
Tags: enterprise-it, software
add a comment
Found an interesting problem with XML Sitemaps and the way Google seems to handle the lastmod time. The sitemap protocol uses W3C Time as the standard. I thought I would write this up and put this out there.
I was seeing errors in GWT for one of our sites for a recently updated sitemap. GWT was complaining and about errors in date time formatting. If we look at the code (original site redacted)
This all looks fine and validates both with w3c and xmllint – the only thing I could see that looked “hinky” is the use of the fractional seconds this sitemap index use last modified dates with fractional seconds to 6 decimal places i.e. to the nearest millionth of a second.
This is fine and is perfectly valid, unfortunately reading the small print of the time standard your supposed to use for last modified date – the fractional second part of is not strictly defined
“This profile does not specify how many digits may be used to represent the decimal fraction of a second. An adopting standard that permits fractions of a second must specify both the minimum number of digits (a number greater than or equal to one) and the maximum number of digits (the maximum may be stated to be “unlimited”).”
Unfortutely the sitemap protocol does not define the number of digits used for fractional seconds and Google is reporting invalid date errors when it parses the date time. In fact it doesn’t seem to like shorter usage of fractional seconds at all.
An example of sloppy work in defining the sitemap standard and to a lesser extent Google’s in not handling a unlimited number of digits – which as the sitemap standard is silent is the obvious fail safe assumption they should have made.
I do wonder how some of today’s crop of PFP (pimply Faced Programmers) would cope with full on seven layer OSI based systems – I wonder if Vint has any old OSI blue books to act as an example of how-to write more stringent standards . (and yes I am aware of the pitfalls of the OSI model of standards making)
Link builder Chutzpah or Spearlink Spam Attack July 20, 2012Posted by Neuromancer in SEO.
add a comment
I work for RBI in the inhouse SEO team and we recently received a email that was part of a link building campaign they had obviously scraped on of our large sites ICIS ( a Big Chemical And Energy Industry site) and found that we had linked to a university article on Bio fuels which had subsequently been taken down.
They then produced a page with similar information to the now 404 page and suggested that we could replace this broken link with their link which was on a car insurance site :-)
From: XXXXX XXXXXX [mailto:XXXX.XXXXX17@gmail.com] Sent: 18 July 2012 12:43 To: Subject: Broken link on your page Hi , I came across your website and wanted to notify you about a broken link on your page in case you weren't aware of it. The link on http://www.icis.com/blogs/biofuels/archives/biodiesel which links to http://www.example.edu/p2/biodiesel/article_alge.html is no longer working. I've included a link to a useful page on biodiesel that you could replace the broken link with if you're interested in updating your site. Thanks for providing a great resource! Link: http://www.example.org/algae-solutions Best, XXXX
Certainly this link builder wins an award for Shear Chutzpah – this approach to link building is similar to a spear fishing attack where an email is targeted directly at a specific individual and tailored to the interests of the victim/mark – which is why I dubbed this a Spearlink attack.
Unfortunately the recipient immediately realised that this looked dodgy. An attack on a government site or one with less savvy staff this attack could have easily succeed.
Its also interesting that they used a blog and not an article on the main site maybe the rss feed for the mt blog was used – rather than a crawl of the main site.
Tags: google, internet
add a comment
Penguin 24th April
Googles latest update the Penguin update launched on April 24. It was a change to Google’s search results that was designed to remove pages that have been spamming Google. Spamming in this case is where people do things like “keyword stuffing”, “hiding text” or “cloaking” that violate Google’s guidelines.
Panda 3.5 19th April
On the 19th an update of the Panda algorithm was launched. Panda is an algorithm designed to promote higher quality pages over lower quality sites.
Parked Domains Problem April 17th
Google also made a rare admission that it made a mistake – on the 17th April they had a problem that was incorrectly identifying sites as parked domains. A parked domain is one that you own but has no content apart from a holding page.
This is an executive summary of a longer post at search engine land here
Best Adsense Fail or Scary Devil Nunnery Recruiting – and a SEO Fail on jobs.guardian.co.uk September 23, 2011Posted by Neuromancer in SEO.
add a comment
Whilst perusing the Guardians job section to analyse the platform they use – I both found a number of ways to completely mess up that entire section of the site and I also found what must be the strangest Adsense advert of all time.
Having managed to create arbitrary pages on the job site I took a look at the Adsense served up at the base of the page which is show here (note the faked page I created was IT related).
Though I must say holding SEO Audits in the style of the “Congregation for the Doctrine of the Faith” does appeal some times – especially when one comes across pages whose markup could be best described as “Your aving a laugh mate”. Though I suspect that HR might winge when we took people down to the basement for the “shewing of the instruments “.
HPCC – High Performance Computer Cluster Open Sourced June 28, 2011Posted by Neuromancer in HPCC.
1 comment so far
I love my Job I get to play with large amounts of data and some cool new cutting edge cloud based toys such as Map Reduce and Mahout and some interesting Web 2.0 Machine learning and AI type algorithms
Map Reduce is a software frame work developed by Google to allow processing on large datasets on clusters of commodity computers. Though in an odd coincidence the Map stage of map reduce is effectively the same approach we used at Telecom Gold to handle processing the Large logs in the Telecom Gold Billing system with a system called GLE Generic Log Extract (written in PL1).
After some hacking I have got a small test cluster up and running to try out Map reduce for some interesting work on clustering documents, in this case web pages on some well known large websites.
I was having some difficulty in getting Mahout which is an open source set of algorithms to perform clustering of documents using map reduce – when almost by chance I found that out parent company has its own system HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems that Map Reduce is used for.
HPCC used to be just an internal system developed by Lexis Nexis and has been used for lexis nexis customers for the past decade. But recently ie last week HPCC has been open sourced. As with Hadoop there is a web based interface
Also there is a windows IDE which directly connects to a HPCC cluster to allow you to run ECL which is the declarative non procedural language used to program jobs to be run on your HPPC cluster.
Panda Update Hits the UK and All English Queries April 11, 2011Posted by Neuromancer in SEO.
Tags: Farmer, Panda, SEO
1 comment so far
It looks like the infamous Panda Google update has arrived outside of the USA. According to Google the Panda update (some times called Farmer/Panda) is meant to better identify low-quality pages and sites.
These are the sort of pages (often seen on “content farms”) with text that is automatically tuned to match the query – but may not provide the best user experience. (Google apparently calls it a “high quality sites algorithm.)
I am due to help give a presentation on SEO to a group of RBI’s developers on Wednesday – so guess whos going to be quickly revamping the presentation deck tomorrow – as well as groveling in the Analytics data to see if any of our sites have been hit.
add a comment
Google have just lanched some new funtionality in GWT (Google Webmaster tools) two new items in the html sugestions: Non Informative Title Tags and Non Indexable Content
Could be usefull in diagnosing problems in sites that need fixing – espesialy as a non informative title tag is a big low quality signal.
Steam Punk Sara Palin February 9, 2011Posted by Neuromancer in Uncategorized.
Tags: graphic novels, sara palin, WTF
1 comment so far
Comics or graphic novels if we are being pretentious have had some odd one offs and crossovers -and recently a genre which mixes science with Jules Verne HG Wells era SF called steam punk has become popular.
And what did I find…. Drum roll please! Ladies and Gentlemen I give you Steam Punk Sarah Palin.
One reviewer commented
“Steampunk Palin defies classification into any literary genre, unless there’s a genre I’m unaware of simply called “WTF?!?
It seems to be in the so bad its good territory I cant wait for the film. A Review is here
RIP Gladys Horton of The Marvelettes February 2, 2011Posted by Neuromancer in Music.
Tags: Marvelettes, Mowtown, Soul
add a comment
Sad to see that Gladys Horton one the founders of the Marvelettes has recently passed away. I saw her obit in the Guardian the other day. I thought I should post up a link to one of my favorite Marvelettes tracks for an early Harlem Apollo show in 63.
As you can see they where doing the moonwalk years before Micheal Jackson and in high heels!
Bing Copying – Google Throws Toys out of Pram February 1, 2011Posted by Neuromancer in SEO.
add a comment
Oh dear sounds like Google is getting upset over Bing using googles results to improve theirs from the write up on Searchengine land here.
Google has run a sting operation that it says proves Bing has been watching what people search for on Google, the sites they select from Google’s results, then uses that information to improve Bing’s own search listings. Bing doesn’t deny this.
Reverse engineering is legal other wise we would still all be using IBM PC’s – Google should just man up and take it as a compliment.
Black Templar Space Marines WH40K January 27, 2011Posted by Neuromancer in War Games.
1 comment so far
I have been thinking about doing some WH40K gaming and one a visit to the mother ship at warhammer world I finaly broke down and brought a space marine battle force which is a basic starter set. I also looked at the various different Space marine factions and was taken with the Black Templars.
The Black Templars are a non standard Chapter of Space marines who deviate in a number of ways from the standard, Most of the time they fight in Companies which are formed in an ad hoc manner. The individual squads and specialists fight side by side out of familiarity and comradeship rather than any imposed organisation.
They also have a stark black and white colour scheme which appealed – though when I looked at the difficulty of painting black armour – I did wonder if I had bitten off more than I could chew. However I have made a start undercoating a 10 men unit and am in the process on painting them up ill post some pics when I dig out my camera. Though i might not get to this standard.
I am just at the stage of doing the white shoulder pads which require multiple coats of grey as a second undercoat so that the white stands out against the back under coat.
I also went to GW in London and Bedford and brought some more a few days later and have some vehicle models which will be done later.
Quora January 8, 2011Posted by Neuromancer in SEO.
add a comment
Just been playing with a new site Quora that is the new hotness :-)
Basically its a site where you can post and answer questions – they describe it as:
Quora is a continually improving collection of questions and answers created, edited, and organized by everyone who uses it.
You can see my Quora profile here