Wednesday, September 24, 2008

Opening up advanced search syndication

[update: 24 Sept @ 12:07 PM : Google Chrome also supports OpenSearch]

[update: 24 Sept @ 11:45 AM : Through the OpenSearch group, I see MS IE 8 Beta supports OpenSearch so perhaps even more sites will realize more complex searches are needed]

As you may well now be aware, I just launched an Alpha release of weblivz.com and what I tried to do there is (in many cases) write intelligent query sets against sites that provide the results as RSS or Atom feeds. So rather than just pulling in every feed we can find, we actually create a query such as "Europe and technology" and so on. It really isn’t easy and requires a lot more work that isn't visible up front. Here are the five issues:

1. Search, no feed
2. Query syntax
3. RESTful
4. The commercial clause
5. Semantics of response formats

I will provide one example in each section, but there are many others I have come across.

Search, no feed
In many cases you can get RSS or Atom feeds from static pages, but as soon as it comes to searching and gathering the results as a feed, you’re in trouble. One example if MeetUp.com.

I can do a bunch of querying to get certain feeds but as soon as I want to something such as "languages and Glasgow" I’m out of luck. In short, you get exactly what you want with most search queries on some of these excellent sites, but they only work when you are ON the site.

This misses the opportunity of syndicating the results to third parties to allow them to point at YOUR site. You end up with having to use the limited feeds available and most of the time this isn’t much use to anyone – especially in the era of content overload and the increasing importance of providing the user with what they want.

Requirement
Ensure all the results from your search query can be syndicated as RSS or Atom. The extra queries against your server will be balanced against higher profile or extra hits on your site.

Query Syntax
In short, query syntax is all over the place. In some cases you can only search for one term and in other cases you can’t use AND or OR. There is a real lack of support for doing interesting things – if you really want to customize the feeds, in most cases you are seriously limited by what you can achieve. One example is http://eventbrite.com.

If I want to search for Tech events in Glasgow or Edinburgh I really need to do 4 separate queries – "tech glasgow", "technology Glasgow", "technology Edinburgh" and "tech Edinburgh". This is a waste of resources all round for something that is relatively simple to achieve.

In some cases typing "or" doesn’t mean the same as "OR" and in others typing "and" gets interpreted as part of the querying rather than ANDing the terms.

Requirement
If we are not going to use http://www.OpenSearch.org then the site needs to at least provide a powerful search interface – perhaps more powerful than any individual is going to use, but in cases such as http://weblivz.com we are actively creating powerful queries against your backend database – saving everyone resources!

Give us a REST
For the majority of queries, a RESTful URL should be enough to get the results. Granted many sites already support this but there are others that provide access only through a POST API XML based syntax. This is good for more advanced queries but sometimes you want to write something simple that returns results in a given format, without passing an extended collection of parameters.

Requirement

If I can’t type (something like):

http://www.yoursite.com/search.atom?query=tech+Glasgow

… then you really need to think about adding this functionality.

The commercial clause
Now this is one I simply DO NOT GET. Many sites say the data can’t be used in a commercial context, without really defining what that means. I have some sympathy for this when it is data you have collated and published – such as a postcode search facility or something... syndicating that would render visiting your site close to pointless.

However, when it comes to user generated content I just cannot understand it. Many sites allow searching, feeds and then insert a clause saying you cannot syndicate the data without a license. The point of these feeds however is to provide a "teaser" to bring the people TO your site… so even in a commercial context, surely allowing your feeds to be displayed can ONLY work for you.

http://www.mystrands.com but it’s got "commercial" all over it (it also doesn’t have RESTful URL access to feeds). So do we write YET ANOTHER MyStrands or do they just provide intelligent syndicated search feeds we can all use and drive business to our and their site.

Requirement
Remove this kind of clause.

Semantics of response formats
This particular part almost drove me to distraction. We now have RSS and Atom as the key formats of feeds – sure there are some variations in versions but we are pretty close to two general formats.

So where is the problem with this? Well, the problem is twofold. The first is different interpretation of what goes into each field and the other is the extensions used within the feeds and the variation on how these are semantically interpreted.

The first of these is particularly an issue with "content" and "summary". Some people put in a short description, others put in formatted html. Some don’t’ put a summary and only add content so you need to parse this somehow if you want to display some kind of a summary.

In addition to this you may find some sites (such as FriendFeed) provide much of the information that should be in atom fields (such as the author) embedded within the content so you would need to parse that to give any kind of standard view.

Now, the extensions is altogether more of an issue. Just try combining some of the feeds using the xmlns:media (http://search.yahoo.com/mrss/ ) namespace. Sometimes the link is in the atom elements, others it’s with the media player element, sometimes the author is in the media credit and others it’s in the atom author field.

You need to parse some of these to death just to get a standard output – in fields where the output should really be a specific extension of the core RSS or Atom specifications. This is a nightmare when applied to video, photo’s, music and so on and makes intelligent search and syndication very difficult.

Call to action
We really need to change some of this. It’s not like we need any scientific breakthrough to make this work – we just need to come to some kind of agreement on the points I outlines above – all the difficult technical stuff has already been done. It doesn’t require ripping software – just extending it.

If you provide an Atom feed you may not want to change that but adding a version parameter in an API is easy. That way you can provide the "new" improved feed. Really the best option is to look at http://opensearch.org but there are any number of options I would happily accept.

We also need to generally improve search and syndication and realize it is not something that takes people away from your site, but rather drives them to your site. The better your search APi the easier it will be for sites like http://weblivz.com to integrate your feed with specialist and content sensitive queries. Users will like that and they will come to you through all sorts of gateways!

Please take away the commercial stuff, or at least tone it down. In most cases people will come to your site if they see a teaser of what they want no matter what site they are on.

Feedback
Thanks for reading. This was based on my experiences creating the weblivz website. I’d love to hear feedback – good or bad. If you have pointers or want to point out anything right/wrong or additions, please suggest and I will update the article.

No comments: