Thursday, March 20, 2008

SES New York: Andrew Tomkins, Chief Scientist at Yahoo

I’m sitting next to Lisa Barone and was going to try a live blogging race against her – sadly, the wi-fi is out and so we’ve agreed to draw. At least; I suggested that while Feeder blasted out from the speakers and The Lisa nodded and so I’m claiming the draw! Woot!

Instead I’ll bash this into Word and copy’n’paste over.

Andrew begins by offering us a detailed walkthrough of Yahoo’s vision of the next generation of search.

  • Internet firmly moved from a curiosity to a substrate for life
  • Content growing, changing, diversify, fragmenting
  • Searching evolving in response
  • Value migrating to ecosystem
  • Semantics of content unlocking the value in the ecosystem

I don’t even know what he means by ‘substrate for life’.

No one ever goes online just to search – he says, wrong, I think as I do! – but Andrew explains that search is a tool that people use to get what they want.

His example begins with someone coming online to book a holiday in Tuscany. They start by searching Google! They hadn’t heard of Yahoo Search yet, he says. Hee. (I said ‘hee’ – I wonder if that’s the Lisa Barone effect). Don’t worry... the searcher winds up at Yahoo Search eventually and develops an addiction to Italian coffee.

Now we’ve got the searcher looking for information on how to make those glorious espressos. Oh no! The searcher really can’t make a decision. He’s no price confidence but finally, after checking a price aggregator, makes a decision and a purchase. Oh no! Now he has a limescale problem... and starts searching again.

Boy! I’m never going on holiday with this sample searcher. What a nervous ninny!

Yahoo thinks today we’re seeing
  • Increasing migration of content online
  • New formsof mediaavailable online
  • Something I was too slow to write down

Things to notice

  • Long-running user goals
  • Search as a hub:
    • Start there
    • Return for resource discovery and at task boundaries
    • Traverse the web broadly to compete task
  • Web service integrated into the task

Gosh! I’m also sitting next to Eric from Stone Temple. Wozah!

Yahoo mentions substrate – should look this up. Oh! Perhaps I’ll twitter it.

How much content is produced each and every day?

  • Published Content: 3-4Gb
  • Professional web content: ~2Gb
  • User generated content: ~8-10Gb
  • Private text content: ~3Tb
  • Upper bound on typed content: 700Tb

Users began to dominate content creation in terms of quantity five years ago.

Private Text content includes things like emails, IMs and intranet content. Upper bound on typed content is all the stuff that people type every day – you know, at work. Yahoo notes that we’re therefore miles away from getting close to that amount online.

How much 'meta data' is produced each day?

  • Anchor text: 100M
  • Tags: 40M
  • Pageviews: 180GB
  • Reviews: 10MB

Anchor texts have been the most important signal in search for 10 years. Tags aren’t likely to change the nature of search because Yahoo expect the data amounts to plateau. By tags he means as on YouTube and delicious; not meta data tags.

The big one will be Pageviews – Toolbars are used to collect trails.

  • Content consumption is fragmenting – nobody owns more than 10% of WW of PVs. Yahoo has the most
  • No single place will own all the content.
  • Best of breed processing will operate on the web version (?)
  • Value transitions to ecosystem

Yahoo’s mocking me by showing slides too complex to blog about. He’s talking about content consumption and how it’s fragmented. They’ve... er, scraped(?) LiveJournal interests and matched it against ages. I’m glad I censored all my LiveJournal interests. The over 57s are interested in death, cheese and cats.

Arhg! Facebook slide. Andrew has shared his cell phone with everyone in the MIT group which is 22,504 users. Woah. I wonder if he gets many calls. Andrew’s point is that we’re not used to this level of access control and as we become more aware of this we’ll see stress and tension on the infrastructure.

We’re used to reading whole web pages but now, with AJAX, we’re used to a more fragmented experience. He draws the parallel with the “choose your own adventure” concept. Woot. I wonder if he’s a fellow gamer.

Now we’re looking at the search interface... and understanding that the number of publishers has increased hugely.

  • Few changes through 2005
  • Entering period of massive change to change more complex content
  • Rich media, aggregation, simple task analysis, etc
  • Moving beyond the stateless query/response paradigm
  • Personalization theory

Although the web grew hugely to begin with the paradigm stayed the same... until now. He’s cautious of the term ‘personalization’ as it’s easy to wreck with rogue data.

Andrew shows a Yahoo slide (I remember what Yahoo looks like – I still go there) and the Yahoo Assist layer. He’s showing someone searching for the movie “the game plan”. What does the searcher want? Do they want show times, trailers, reviews or something else? The top of the Yahoo search space is used to aggregate the ambiguous tasks to try and answer those questions. Not just Yahoo who’ve been trying these things... he shows Microsoft example (good move; keep Microsoft sweet) and then shows Google’s flight search.

  • Structured database power a vast majority of pages of the web
    • Certainly ecommerce catalogs
    • But also user generate content (eg blogs)
  • Content owners open to exposing structure, but don’t see how and why
  • Microformats adoption at an all-time high
  • Yet, it’s produced much more...
  • Waaah... he’s going too fast.

The Killer App

  • Wide-ranging support for semantic web standards
  • Vocabulary to surface structure and semantics
  • Community tools to evolve standards and vocabulary

What is the Killer App? Search

  • Publishers and search engines collaborate
  • Users see richer search experience
  • Accomplish their tasks faster and move effectively

Ha-ah? Want an example; let’s look at the enhanced Yelp results (I wonder if he also eats cheese steaks?). The site is similar in presentation; a mashup of images, links and textual advice. In fact there are loads; New York Times, Gawker, others...

Andrew reckons the LinkedIn example is suggestive of what might happen with ‘people search’ in the future. LinkedIn is a Yahoo partner. Didn’t know that...

  • Microformats
    • hCard, hEvent, hReview, hAtom, XFN
    • More as they get adopted
  • RDFa and eRDF markup
  • OpenSearch
    • +extensions to return structured data
  • Atom/RSS feeds
    • +extensions to embed structured data

Yahoo thinks this is the future; using microformats to prevent structured meta data about content to search engines. So what do we put in this data set?

  • dataRSS provides a common framework for embedding structured data
    • Use with RDFa, eRDF or OpenSearch
    • Preferred Vocabulary includes
      • Atom, Dublin Core
      • Creative Commons
      • FOAF, GeoRSS...
      • He’s going too fast again...

Yahoo will be announcing a set of tools and wants people to work together to agree on standards (and not let be dominated by ‘one’ Search Engine. Don’t know who that could be.)

  • Yahoo! Open search platform does not modify ranking
  • Richer abstracts may provide more information to users and draw higher quality/quantity of clicks
  • We want rich abstracts that give users a better experience
    • We don’t want misleading abstracts

So Yahoo are really announcing a new form of SEO where content owners try and get shiny and attractive abstracts into the SERPs which attract clicks. Yahoo would like them and content owners to come to an agreement with what makes a good abstract. When that agreement is in place then they’ll work together to keep the good stuff in and the bad stuff out.

In fact, anyone on the web can make use of a self-service model to upload their abstracts and then anyone interested in this can subscribe to those recommendations. Ah-ah; this sounds like Google’s Subscription Links offering.

Let’s look at the ‘whole story’

  • User needs becoming more complex
  • Content growing, changing, diversifying, fragmenting
  • Search responding by increase in sophistication
  • Value migrating to ecosystem
  • Unlock the value by enabling interoperability – expose semantics

Is the HTTP and HTML system the right model any more? Yahoo thinks... maybe not. Something more complex may be needed. As a result the value is moving to the ‘ecosystem’ and therefore the ‘quick win’ here is to expose the data content publishers may have locked away but which could be presented to the search engine.

Wow! What a long write up. I think that was the best key note yet. Andrew shared a lot of ideas and I’ve a really good vision of what Yahoo is up to now.