Tuesday, May 09, 2006

Google Patents, Google Caches and Tree Obscured Woods

Sometimes you can't see the wood for the trees. I was in the middle of a presentation where I stress the need for a page to stay on topic and build up a good history in Google when I was reminded of a recent Matt Cuts post on Google's proxy unit.

If Blogger's crap image upload works and re-sizes this the following image is a picture from what's widely known as Google's temporal patent application.



Notice how the History Unit can sometimes sit in between the crawler (document locator) and the web (corpus) and how it sits in between the web front and the index.

Compare that then to Matt's own diagram of the cache.



Here the Cache has almost the very same position.

In matter of fact it would make a lot of sense to combine the role of the Google Cache and the role of the Google History unit. You would keep several caches of the evolution of the web. The advantage of this approach is that you're keeping a copy of how the web page actually was and as you improve your algorithm you'll be able to review the page would have scored. In other words, you can apply the new algorithm to the old site.

The alternative is to use the History Unit as the place to store the algorithm/indexer's interpretation of the web page. Ie, record which keywords it does well on and its structural elements. The advantage here is that you'll use a lot less space.

Google could even do both as I suspect space is not a concern (despite a recent Register diatribe which mulls otherwise).

0 comments: