Monday, November 28, 2005

WebmasterWorld Grumpy Bear

I'm being a grumpy bear about the WebmasterWorld anti-robots/anti-search situation.

WebmasterWorld is a large and popular forum where webmasters can discuss site issues. Important sections on the site are the SEO and search engine threads.

The forum has recently banned search engines from indexing it and, as a result, it has been dumped from Google and the others.

Why would a forum which needs it web traffic do this? We're told that they had to. I don't like that one bit - it sends an anti-search and anti-SEO message.

WebmasterWorld is highly respected. I would say many web savvy clients check out WebmasterWorld (and, of course, many small SEO firms camp out there). I personally believe the quality of the forum has nose dived; it's now full of newbies who sound like a broken record and ask the same questions again and again, it's peppered with veterans who no longer care to respond to the broken record.

This isn't the search engine problem though. We're told that WebmasterWorld is plagued by "bad bots". These are automated user agents (like search engine spiders) which crawl the site and take the content. There is so much of this activity that the forum's web servers struggle to keep up.

The webmaster's response to these bad bots at WebmasterWorld is to block all bots via the robots.txt "robots exclusion protocol".

This is foolish. Bad bots do not obey this protocol. This change will only effect good bots.

At WebmasterWorld they've made other attempts to block the bad bots - they've changed the site so that you must now log in before you can view any real data. This is the draconian tactic to take if everything else fails. The robots.txt is just stupid.

Normally, I wouldn't care less. WebmasterWorld can stab themselves in the foot if they wish.... except the site has the SEO profile that puts it in the limelight somewhat. I believe the site's actions give out the wrong message.

It is possible to deal with bad bots without blocking your site from search engines. It is. Further more, blocking your site from search engines does not stop bad bots (though it makes the site harder to find!)

We're told that WebmasterWorld has tried a whole host of complicated and thorough defenses and that they all failed. I don't get it. If the forum had those resources to hand then they certainly have the resources to re-design the site to be less of a bad bot target. For example, at times of heavy load the site could ask for user input, a captcha or a question. I came up with that crazy theory in about 2 seconds of thought.

Newspapers are hugely scraped by bad bots and they deal with the issues WebmasterWorld has caved in on.

WebmasterWorld have cited impressive figures to support their scraping problem - but I've never found a single scraped entry from WebmasterWorld. Do thousands of people really download the site for their own personal desktop edition? An edition which would require a desktop search (rather than Google) to search and which would need to be kept up to date? Meh.

We're told this is a test. I predict we'll see that WebmasterWorld no longer has the pull to attract the SEO experts (especially after this) so I don't think it'll become an exclusive oasis. I suspect we'll see it back in the search engines when they retract their robots.txt change. I imagine there will be complaints that the robots.txt rules should be changed (all a red herring since bad bots ignore it).

Saturday, November 26, 2005

Advertise on this Site

Last night I mucked around with the new "Advertise on this Site" option on AdSense. I did so while wearing the webmaster hat for my main hobby site. The "Advertise on this Site" link doesn't ever really appear there since I've one publisher who already cleverly targets image banners there for me.

As already noted by the AdSense community the "Advertise on this Site" option is horribly bare bones. We're encouraged not to set up separate AdSense accounts for every site but keep them all together under one (this blog, for example, is part of the same account).

The "this Site" concept is broken. Most of us have a number of sites. If you click on the "Advertise on this Site" option on Search Commands then you get the page I designed for GameWyrd. It's odd Google let the feature go live when it is badly targeted.

I was frustrated by the lack of space too. I wanted to explain how GameWyrd uses AdSense on my page. I prefer to sell cheap banners to community publishers as that's what I want to appear on the site. Google AdSense fills any gaps I have in the banner rotation. There's simply not enough character space to say this.

I'm really far from impressed with the feature so far.

Friday, November 25, 2005

Custom error pages, 404 header response and Google sitemap XML

A custom 404 page is important. You don't want to loose visitors off your site and an user friendly error page, one which quickly re-enpowers the user, is key here.

It's also key to ensure that your "page not found" message actually returns the 404 header response. At the end of 2004 in Vegas, Yahoo banged on and on about this in a presentation to SEOs and webmasters.

The 404 response is also key if you wish to make use of the reporting capabilities of Google's sitemap XML project.

The issue is that if you use Apache and an ErrorDocument command that you'll wind up with a 302 response - that might actually be correct as the server is redirecting the user agent to the custom 404 page.

An .htaccess file might look like:

ErrorDocument 404 http://www.example.com/404e.php
The killer catch is that it's not possible to tweak Apache or use PHP to actually get requests to non-pages (www.example.com/no-page/ for example) to actually return the 404 header.

The good news is that there is a compromise which Google accepts - and this compromise is good enough to get your sitemap XML verification file accepted.

With PHP you can have your custom error page issue a 404 header before showing you any HTML. That's fairly easy:
header("HTTP/1.0 404 Page not found");
at the start of the page in a chunk of PHP.

If you examine the header of the custom error page (404e.php in my example) then that does correctly return the 404 header.

Google is following the 302 to the custom error page and then checking the header off that error page. This is really what you'd expect Google to do as it's the only way to deal with two or more redirects in a chain.

The good news is that wikis and sites optimised to have SEO friendly URL structures can take part in the rather handy sitemap XML.

Now if Google would just set up the vanity URL http://sitemap.google.com for the lazybones among us - I'd be happy.

Thursday, November 24, 2005

Google

It's all Google; isn't it.

Google's just about noticed this new domain. The home page URL and the main blog index URL are in the index but the pages themselves have not been spidered.

Google Analytics
I was lucky enough to have a play with Google Analytics before it was released to the public. I thought at the time they had under-estimated how huge the impact would be. Google have just killed the cream in a previously cash rich industry. Who will pay for unsupported web analysis now?

Google Analytics' weakness is that it's entirely off the shelf and there's no support at all. If you're the type of webmaster or marketing manager who needs to have tracking set up "just so" then, even though it's free, Google Analytics may not be for you.

Google Analytics doesn't offer any bid management either. The system can take a peak at your AdWords costs and help calculate ROI but that's it.

Although the likes of Web Trends and Hitbox / WebSideStory are in big trouble, the 'Best of Breed' Coremetrics are fine and are bidding gurus like HitDynamics will be safe if they can push their product on.

Google Base
I like the idea of Google Base and I see its potential. The carrot being waved at us that what goes in Google Base might end up in Froogle or Google's main index. That's a pretty cheap and transparent carrot.

What Google Base lacks is the sense of community. I remember when Ciao and Dooyoo added the community features for the first time to their sites; it all took off then. I don't mean the advanced community features; just the ability to comment on other people's reviews.

I suspect we'll see more updates for Google Base. The URL http://base.google.co.uk is currently a blank page (rather than an error) so I suspect we'll see something there too. The original guidelines in Google Base insisted that any products added to the database must be for sale in the States and priced in dollars. That's just insane. That's anti-Google Base, in fact. We have Froogle US for that.

Click to Call
I think this is the most exciting PPC twist in a while (including the new contextual bidding rules). I pitty those SEO agencies who are contracted to long term deals with the minnows of pay-per-call technologies as they'll be fairly hamstrung now.

Google takes on the cost of the call. This could be interesting. This is also why, I guess, Google's been so interested in buying up dark fibres.

Monday, November 14, 2005

Jagger Summary

I've not seen anyone provide much of a summary of the Jagger changes. I'll cut all the interesting bits out of a summary I produced for a client and offer up bare bullet points here.

Jagger1

  • Anti-spam improvements: Google’s sought to reduce the number of spammy sites in search results by formulating more thorough and more accurate automated means of spotting hidden text (text hidden in layers was particularly targeted). Google also improved its ability to spot doorway pages (pages fill of keywords which serve no other purpose than to link to another page, often on a different domain).
  • PageRank update: The PageRank metre on the Google toolbar was updated.
  • Links update: The number of backlinks shown by the “link:” search command was updated.
  • Changes to Chinese and European search results: Google tweaked its geographical filtering. Most of the sites we work with and which are hosted in Europe saw fairly substantial gains on Google’s .com results. This suggests that the weight against sites hosted outside the US was reduced. Google now has more faith in other algorithmic means to determine a web site’s target audience.
  • In this period we saw high traffic sites which have been online for four or more years improve their search results over than younger or lower traffic sites.

Jagger2

  • Ranking algorithm change – deep links increase in importance: Links from external web sites to pages beyond your home page are now important. In this time new monitor stations gained PageRank 1 or 2 on their home page and those we test with deep links earned PageRank 4 or 5 on their popular deep linked pages.
  • Ranking algorithm change – keywords in domain names: Having a keyword rich domain name became more beneficial as the weight against keywords in domains was reduced.
  • Ranking algorithm change – hyphens in domain names: The weight against multiple hyphens in domain names increased.
  • Ranking algorithm change – measured traffic: Google has an idea of the traffic a web site has by monitoring click throughs from Google’s own search results and through users with the Google toolbar and other Google software. Pages (not sites) with particularly low internal traffic (from users already on the site) dropped in rankings in favour of pages which users visit and find more often.
  • The Jagger2 update included the results of the first sweep of spam reports which followed Jagger1.

Jagger3

  • Index change – Canonical corrections: Google improved its understanding of which is the most appropriate URL for any given web page and when two or more URLs may actually refer to a single site (for example, http://example.com or http://www.example.com).
  • Ranking algorithm change – Freshness: Google changed the importance of a page being either fresh or stale. Certain keyword searches suit fresh pages more than stale pages (based on user behaviour) while other searches favour stale (not recently changed and old) pages over fresher (recently changed or added) pages.

Tuesday, November 08, 2005

Yahoo Vs Google Vs Your Tastes


Let there be no doubt that Microsoft is not the only company determined to take on Google. Yahoo certainly is too. This plaque for the Yahoo Mail team is controversial not only because it makes quite explicit who Yahoo has in their sights, that they're very serious about winning but also because it draws a parallel between Google and the Nazis. I don't think this was intentional. In fact, as a Brit I'm mildly surprised that the Americans used a British success as an inspirational model. I do think the photograph of the plaque shows just how "dangerous" search engines can be. Information takes on a very different meaning when it leaves a private space, looses it's tone of voice and becomes public domain.

A lot of debate the plaque has kicked up centres around the actual mail offerings from the two providers. Which is better? Right now I use Gmail because I can procmail most of my mail to it and have Google spam filter for me before then popping it back into Outlook. Bliss. Even with popfile (and a well trained orange octopus at that) my mail sorting took ages. I've a paid for Hotmail account too. Mind you, I was on the first (thousand) onto hotmail and will keep the account (despite numerous spam backlashes). By many accounts the new Yahoo mail will be a strong contender too - but how many email addresses do I need?

(And there's a Google tracking code in my link to Popfile because I was lazy, searched for popfile and copied the link location straight from a right-click. You just have to assume that Google filters unexpected referrer information out of its analysis)

One of the wins I believe the search engines can gain from offering great (therefore popular) mail systems is that of language and response analysis. It might be evil for Google to learn about you from the contents of your email. I suspect they'll not be that direct. However, what I believe Google will do is learn which email messages you mark as spam. That's a good way of finding rogue IP addresses or spam advertised URLs. Google must measure the clickthrough and keyword matches for the AdWords which line the side of Gmail's web interface. Whereas Google may not read my email to learn that I spend my free time roleplaying but Google is sure to notice that my Gmail account generates clicks for roleplaying-related keywords.

I welcome personalisation. I really don't mind if Google's machines (Gmachines™) scan my email or watch my surfing habits. I don't do anything with the RPG programming language, for example, nor do I have an interest in Rocket Propelled Grenades. I want my RPG searches to turn up Role-Playing Games.

I suspect too many SEOrs are avoiding personalisation issues because it could be a pain for them. I believe personalisation is an other offering for a good SEM firm to make. I do think the larger search marketing firms are at the advantage here. We can do the proper demographic research (and we do here). Crude reporting tools like WebPosition become pointless (or, at least, much less effective). I would expect more conversation on the forums and in the community about it but I wonder if the "sandbox" of 2006 will be the one in which many SEOrs bury their heads and hide from personalisation. It's easier to talk about funky new email systems and PageRank updates.

Thursday, November 03, 2005

Google Trivia #439



I have a SEO blog ergo I must post Google trivia. It's the law.

Google has expanded the range of options for the customised homepage (a carrot by which to lure people into Google Accounts and personalised searches). Finally those of us outside the US have a little more to play with.

I added Edinburgh's weather. Yeah. I must be a masochist. What's the difference between Edinburgh's weather in the UK and Edinburgh's weather in the US. The units change. We also seem to get an extra day of forecast in the UK.

There are some slight oddnesses too. The UK uses Celsius for temperature, true, but we're a mixed up lot and talk about miles per hour and not km per hour. Google gives wind speed in in km/h. You have degree of customisation by picking either Google.com or Google.co.uk as your home page (or Google.com.au as the case may be) but minute customisation isn't there. Yet.

Tuesday, November 01, 2005

Googling the Future

It was during a Samhein party that I bumped in a world expert on machine translation and machine learning. It'll come as little surprise to know that he's off Googlewards.

I don't feel I should blog all the insights I gleamed from this teacher-of-PhDs, as we discussed at the time, information will be key to everything. Information is currency. I was cautioned against doing all those five minute surveys for 50p and the like as all this information can, will be, may already being carefully analysised and if things go wrong could be used against me.

Analysing lots of information is something that computers do well already, especially Google. Google has access to such a wealth of information it could perform a trend analysis on virtually everything. People were worrying about AI long before we realised that such much information could be processed and weighed at once.

This particular soon-to-be-Googler had another challenge and another point of interest. He was interesting in being able to pick up a seismically important piece of news on the first hint of its happening. By the time a piece of news is in the headlines it's old news. If Google News has stories that Tesco was in trouble (fat chance) then it's too late to sell your shares. The key is being able to spot the information the very first time it appears on the Web (or, say, in Gmail or Google Talk) and become aware of the likely consequences. A plunge in the stockmarkets is one such example (combine world beating stock exchange technology with Google's $6bn war chest) but the list goes on all the way up to that political rumour which sparks a war. My "Google contact" (not yet, must apply more beer) has done machine translation work for the military before. Translating from Chinese or Arabic pays well but the real treasure is being able to translate on the fly and then sound the alarm when one call mentions "we'll cut the grass tonight" or some other obscure code word which the computer calculates is likely to be hugely significant.