It’s all for a good cause as we raise money and awareness for prostate and testicular cancer. Cancer touches everyone from friends, family and even co-workers. As such, it’s a cause we can happily get behind by looking ridiculous for a month.
Whist experimenting at work with the new GWT Api I found that the python example google provides has a bug – as well as some rather opaque documentation that misses out a crucial step. When you set up your credentials you also have to make sure that the consent screen section is fillout with at leat the project name and a suport email!
Updated script follows
# --- MJW Fixed Bugs
logging.basicConfig() # added to fix warning error
from apiclient import errors
from apiclient.discovery import build
from oauth2client.client import OAuth2WebServerFlow
# Copy your credentials from the console rember to update the consent screen with a name and a suport email
CLIENT_ID = 'xxxxx'
CLIENT_SECRET = 'yyyy'
# Check https://developers.google.com/webmaster-tools/v3/ for all available scopes
OAUTH_SCOPE = 'https://www.googleapis.com/auth/webmasters.readonly'
# Redirect URI for installed apps
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
# Run through the OAuth flow and retrieve credentials
flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, REDIRECT_URI)
authorize_url = flow.step1_get_authorize_url()
print 'Go to the following link in your browser: ' + authorize_url
code = raw_input('Enter verification code: ').strip()
credentials = flow.step2_exchange(code)
# Create an httplib2.Http object and authorize it with our credentials
http = httplib2.Http()
http = credentials.authorize(http)
webmasters_service = build('webmasters', 'v3', http=http)
# Retrieve list of websites in account
site_list = webmasters_service.sites().list().execute()
# Remove all unverified sites - note chage origioanl uses url and not siteURL
verified_sites_urls = [s['siteUrl'] for s in site_list['siteEntry'] if s['permissionLevel'] != 'siteUnverifiedUser']
# Printing the urls of all sites you are verified for.
for site_url in verified_sites_urls:
# Retrieve list of sitemaps submitted
sitemaps = webmasters_service.sitemaps().list(siteUrl=site_url).execute()
if 'sitemap' in sitemaps:
sitemap_urls = [s['path'] for s in sitemaps['sitemap']]
print " " + "\n ".join(sitemap_urls)
I finally bit the bullet and brought some new hard ware with the aim of setting up a small home lab for the purposes of learning hadoop. I set up a small 3 machine cluster whist at RBI (part of reed Elsevier) to have a look at ML and AI and As I have the time was interested in learning hadoop especially as in the past I worked on a billing system that used the map reduce programing paradime on PR1ME 750’s. (dont knock it back then 17 750’s running together was not to be sniffed at) .
There are lots of sites dedicated to building a home labs for Vmware/vshphere and Cisco but none for hadoop so I thought I ought to document the process here,
After debating finishing of my antec 900 build and running on that using vmware player I decided to use a popular microserver from HP as they where cheap around £120 for the base machine there is a £50 rebate until the end of dec.
It out of the box can take 4 stata drives and with tweaks can take two more in the top 5.25 bay a 3.5 and 2.5 inch in a suitable adapter.
The main aim is to run a small lab cluster of virtual hadoop machines so that I can both get to grips with hadoop and MR but also so I can get my hands dirty (instead of just uing a single vm) and understand how it all works under the hood and play with some of the newer hadoop features such as NN HA .
Yes I know that hadoops performance under virtual machines isn’t going to be very good this is primarily a learning environment not one you would use for doing useful work.
I have the machine booted and after a struggle managed to create a bootable ESXi usb stick. I need to by a bigger USB stick as I want to install the hypervisor on that leaving all the disks free for VM’s.
The next step is to buy more memory to take the system up to its max of 16GB and some more DASD (disk drives for you civilians)
Found an interesting problem with XML Sitemaps and the way Google seems to handle the lastmod time. The sitemap protocol uses W3C Time as the standard. I thought I would write this up and put this out there.
I was seeing errors in GWT for one of our sites for a recently updated sitemap. GWT was complaining and about errors in date time formatting. If we look at the code (original site redacted)
This all looks fine and validates both with w3c and xmllint – the only thing I could see that looked “hinky” is the use of the fractional seconds this sitemap index use last modified dates with fractional seconds to 6 decimal places i.e. to the nearest millionth of a second.
This is fine and is perfectly valid, unfortunately reading the small print of the time standard your supposed to use for last modified date – the fractional second part of is not strictly defined
“This profile does not specify how many digits may be used to represent the decimal fraction of a second. An adopting standard that permits fractions of a second must specify both the minimum number of digits (a number greater than or equal to one) and the maximum number of digits (the maximum may be stated to be “unlimited”).”
Unfortutely the sitemap protocol does not define the number of digits used for fractional seconds and Google is reporting invalid date errors when it parses the date time. In fact it doesn’t seem to like shorter usage of fractional seconds at all.
An example of sloppy work in defining the sitemap standard and to a lesser extent Google’s in not handling a unlimited number of digits – which as the sitemap standard is silent is the obvious fail safe assumption they should have made.
I do wonder how some of today’s crop of PFP (pimply Faced Programmers) would cope with full on seven layer OSI based systems – I wonder if Vint has any old OSI blue books to act as an example of how-to write more stringent standards . (and yes I am aware of the pitfalls of the OSI model of standards making)
I work for RBI in the inhouse SEO team and we recently received a email that was part of a link building campaign they had obviously scraped on of our large sites ICIS ( a Big Chemical And Energy Industry site) and found that we had linked to a university article on Bio fuels which had subsequently been taken down.
They then produced a page with similar information to the now 404 page and suggested that we could replace this broken link with their link which was on a car insurance site :-)
From: XXXXX XXXXXX [mailto:XXXX.XXXXX17@gmail.com] Sent: 18 July 2012 12:43 To: Subject: Broken link on your page Hi , I came across your website and wanted to notify you about a broken link on your page in case you weren't aware of it. The link on http://www.icis.com/blogs/biofuels/archives/biodiesel which links to http://www.example.edu/p2/biodiesel/article_alge.html is no longer working. I've included a link to a useful page on biodiesel that you could replace the broken link with if you're interested in updating your site. Thanks for providing a great resource! Link: http://www.example.org/algae-solutions Best, XXXX
Certainly this link builder wins an award for Shear Chutzpah – this approach to link building is similar to a spear fishing attack where an email is targeted directly at a specific individual and tailored to the interests of the victim/mark – which is why I dubbed this a Spearlink attack.
Unfortunately the recipient immediately realised that this looked dodgy. An attack on a government site or one with less savvy staff this attack could have easily succeed.
Its also interesting that they used a blog and not an article on the main site maybe the rss feed for the mt blog was used – rather than a crawl of the main site.
Penguin 24th April
Googles latest update the Penguin update launched on April 24. It was a change to Google’s search results that was designed to remove pages that have been spamming Google. Spamming in this case is where people do things like “keyword stuffing”, “hiding text” or “cloaking” that violate Google’s guidelines.
Panda 3.5 19th April
On the 19th an update of the Panda algorithm was launched. Panda is an algorithm designed to promote higher quality pages over lower quality sites.
Parked Domains Problem April 17th
Google also made a rare admission that it made a mistake – on the 17th April they had a problem that was incorrectly identifying sites as parked domains. A parked domain is one that you own but has no content apart from a holding page.
This is an executive summary of a longer post at search engine land here
Whilst perusing the Guardians job section to analyse the platform they use – I both found a number of ways to completely mess up that entire section of the site and I also found what must be the strangest Adsense advert of all time.
Having managed to create arbitrary pages on the job site I took a look at the Adsense served up at the base of the page which is show here (note the faked page I created was IT related).
Though I must say holding SEO Audits in the style of the “Congregation for the Doctrine of the Faith” does appeal some times – especially when one comes across pages whose markup could be best described as “Your aving a laugh mate”. Though I suspect that HR might winge when we took people down to the basement for the “shewing of the instruments “.