Monday, September 29, 2008

Happy Belated Anniversary Google




I kinda wish i had spent more time on this the other day but i was busy doing other stuff. Thanks to a link from DeWitt Clinton (who now works at Google) on Google jobs back in 1999, i played around with archive.org for the first time in ages.

People may already have found this, but i had to blog anyway about some gems i found that blew me away.

1. The Google Beta Search Engine in 1998 - link
















2. The "Google Friends" mailing list which i suspect was the first public message about Google. It was written by Larry Page himself on 28 April 1998 at 10.28 PM (so must have been a late one!). I love message 5 "Google gets funding" !!

















The "Group Info" suggests 1 post a month. I like that either Larry or Sergey categorized the group as "Culture & Lifestyle : Gender : Women, Girls, Mothers, Daughters".




























I did actually send a "Happy Birthday" message to "google-friends-subscribe@makelist.com" - it's nice to be nice. We made a big deal of my son's 5th birthday and i reckon i see Google as much as him these days! It hasn't bounced so maybe i'll get a reply ;)
























3. This is the first public email (mentioned above) i am aware of by Larry Page himself about Google.
























4. There is also the first email from Larry Page and another in July 1998 where he talks about the new features. For anyone stressing out over their servers, here is surely one of the historic paragraphs of the Internet:






Combined "our server" (some suggest there is now > 64,000 servers) and "try back in a minute or two". Come on guys - how did you get funding with that line ;) If this kind of thing doesn't inspire you as an entrepreneur you're maybe in the wrong job!

Look around and let me know if you find other similar gems.

This reminds us these were two guys who were like every other entrepreneur at the start - they had no idea where they would be in 10 years.

Hey, they'll likely never ready this, but Well Done - i consider myself inspired !!

Think i'm starting to really think about writing a book on this kind of thing. Amazing!

PS. I forgot to mention the Google Stickers... check this.

Thursday, September 25, 2008

The art of a search query

One thing that i don't see much coverage over is how non-technical users get the power afforded by advanced querying. Something that drive me to create something LIKE http://weblivz.com/ was through my experiences with querying in a project which involved many hundreds of millions of distributed records.

In that project we had a team over over 20 experts writing specific queries that were semantically relevant to the area they were in and the output of one search was actually a fairly complex backend query most of the time consisting of the union of multiple backend queries all reformatted for a specific output.

Today many sites - and emerging distributed query sites - are focused on simple queries, but this requires that you KNOW semantically what to look for and that you want to type it in all the time. In addition it assumes you know how to construct fairly complex (it's all relevant) queries.

So, YES, we need all these cool sites that do the keyword searching. But we ALSO need something a bit higher level. We need to hide the user from the complexity of searching and also make it easier for them to remember the kinds of searches they either constructed or used before.

Wednesday, September 24, 2008

A syndication formatting cache

I'm really thinking about this stuff just now, so this note is as much use for me as anyone else.

We have a ton of sources all working with Atom/RSS formats but being semantically different and in cases extending the same concepts in different ways (e.g. Digg has it's own namespace in its Atom feeds for authors).

Imagine a service that indexed and transformed these sources to normalized formats. So you could basically do XPATH style queries (the interface wouldn't be so complex of course) on the RSS/Atom sources and not only get the data in a given element, but be semantically accurate on what you are getting.

In addition, extension namespaces could also be queried, so you could ask for media items from youtube, flickr, meefedia and so on and get an accurate result.

This service may even be useful if we were all using the same format, but at a time where joining feeds is near impossible when you are thinking about the user, it may be useful to have a service helping out.

I've already written a bunch of tranforms to do this and to be honest had to write an Xslt for every single feed i got (i think delicious was ok), so i know the headache as others look for more advanced syndication feeds!

Long tail of atom extended formats

I picked up on a post by Andrew Turner on OGC Geospatial Search Summit
“Of course, a format can expand upon this and offer more complex formats that
conform to more complex specs. But by at least providing a common baseline means
that almost any service can easily interconnect with another service.”

I can see why we need to stop at some point! The issue is that in the long tail, these extended formats are quite prevalent and I’d like to see extended communities supporting people who want to extend. The reason I say that is that even in the rich media space I have numerous Xslt’s, function calls and so on to normalize what is essentially the same data. GeoRSS is an example of a specific community that does it well!

My thinking is that if (at the extreme) two companies in the world extended for a very specific topic, we could at least get some normalized view of the data for everyone as a response from an OpenSearch query.

Opening up advanced search syndication

[update: 24 Sept @ 12:07 PM : Google Chrome also supports OpenSearch]

[update: 24 Sept @ 11:45 AM : Through the OpenSearch group, I see MS IE 8 Beta supports OpenSearch so perhaps even more sites will realize more complex searches are needed]

As you may well now be aware, I just launched an Alpha release of weblivz.com and what I tried to do there is (in many cases) write intelligent query sets against sites that provide the results as RSS or Atom feeds. So rather than just pulling in every feed we can find, we actually create a query such as "Europe and technology" and so on. It really isn’t easy and requires a lot more work that isn't visible up front. Here are the five issues:

1. Search, no feed
2. Query syntax
3. RESTful
4. The commercial clause
5. Semantics of response formats

I will provide one example in each section, but there are many others I have come across.

Search, no feed
In many cases you can get RSS or Atom feeds from static pages, but as soon as it comes to searching and gathering the results as a feed, you’re in trouble. One example if MeetUp.com.

I can do a bunch of querying to get certain feeds but as soon as I want to something such as "languages and Glasgow" I’m out of luck. In short, you get exactly what you want with most search queries on some of these excellent sites, but they only work when you are ON the site.

This misses the opportunity of syndicating the results to third parties to allow them to point at YOUR site. You end up with having to use the limited feeds available and most of the time this isn’t much use to anyone – especially in the era of content overload and the increasing importance of providing the user with what they want.

Requirement
Ensure all the results from your search query can be syndicated as RSS or Atom. The extra queries against your server will be balanced against higher profile or extra hits on your site.

Query Syntax
In short, query syntax is all over the place. In some cases you can only search for one term and in other cases you can’t use AND or OR. There is a real lack of support for doing interesting things – if you really want to customize the feeds, in most cases you are seriously limited by what you can achieve. One example is http://eventbrite.com.

If I want to search for Tech events in Glasgow or Edinburgh I really need to do 4 separate queries – "tech glasgow", "technology Glasgow", "technology Edinburgh" and "tech Edinburgh". This is a waste of resources all round for something that is relatively simple to achieve.

In some cases typing "or" doesn’t mean the same as "OR" and in others typing "and" gets interpreted as part of the querying rather than ANDing the terms.

Requirement
If we are not going to use http://www.OpenSearch.org then the site needs to at least provide a powerful search interface – perhaps more powerful than any individual is going to use, but in cases such as http://weblivz.com we are actively creating powerful queries against your backend database – saving everyone resources!

Give us a REST
For the majority of queries, a RESTful URL should be enough to get the results. Granted many sites already support this but there are others that provide access only through a POST API XML based syntax. This is good for more advanced queries but sometimes you want to write something simple that returns results in a given format, without passing an extended collection of parameters.

Requirement

If I can’t type (something like):

http://www.yoursite.com/search.atom?query=tech+Glasgow

… then you really need to think about adding this functionality.

The commercial clause
Now this is one I simply DO NOT GET. Many sites say the data can’t be used in a commercial context, without really defining what that means. I have some sympathy for this when it is data you have collated and published – such as a postcode search facility or something... syndicating that would render visiting your site close to pointless.

However, when it comes to user generated content I just cannot understand it. Many sites allow searching, feeds and then insert a clause saying you cannot syndicate the data without a license. The point of these feeds however is to provide a "teaser" to bring the people TO your site… so even in a commercial context, surely allowing your feeds to be displayed can ONLY work for you.

http://www.mystrands.com but it’s got "commercial" all over it (it also doesn’t have RESTful URL access to feeds). So do we write YET ANOTHER MyStrands or do they just provide intelligent syndicated search feeds we can all use and drive business to our and their site.

Requirement
Remove this kind of clause.

Semantics of response formats
This particular part almost drove me to distraction. We now have RSS and Atom as the key formats of feeds – sure there are some variations in versions but we are pretty close to two general formats.

So where is the problem with this? Well, the problem is twofold. The first is different interpretation of what goes into each field and the other is the extensions used within the feeds and the variation on how these are semantically interpreted.

The first of these is particularly an issue with "content" and "summary". Some people put in a short description, others put in formatted html. Some don’t’ put a summary and only add content so you need to parse this somehow if you want to display some kind of a summary.

In addition to this you may find some sites (such as FriendFeed) provide much of the information that should be in atom fields (such as the author) embedded within the content so you would need to parse that to give any kind of standard view.

Now, the extensions is altogether more of an issue. Just try combining some of the feeds using the xmlns:media (http://search.yahoo.com/mrss/ ) namespace. Sometimes the link is in the atom elements, others it’s with the media player element, sometimes the author is in the media credit and others it’s in the atom author field.

You need to parse some of these to death just to get a standard output – in fields where the output should really be a specific extension of the core RSS or Atom specifications. This is a nightmare when applied to video, photo’s, music and so on and makes intelligent search and syndication very difficult.

Call to action
We really need to change some of this. It’s not like we need any scientific breakthrough to make this work – we just need to come to some kind of agreement on the points I outlines above – all the difficult technical stuff has already been done. It doesn’t require ripping software – just extending it.

If you provide an Atom feed you may not want to change that but adding a version parameter in an API is easy. That way you can provide the "new" improved feed. Really the best option is to look at http://opensearch.org but there are any number of options I would happily accept.

We also need to generally improve search and syndication and realize it is not something that takes people away from your site, but rather drives them to your site. The better your search APi the easier it will be for sites like http://weblivz.com to integrate your feed with specialist and content sensitive queries. Users will like that and they will come to you through all sorts of gateways!

Please take away the commercial stuff, or at least tone it down. In most cases people will come to your site if they see a teaser of what they want no matter what site they are on.

Feedback
Thanks for reading. This was based on my experiences creating the weblivz website. I’d love to hear feedback – good or bad. If you have pointers or want to point out anything right/wrong or additions, please suggest and I will update the article.

Wednesday, September 10, 2008

Gazopa - it's fun too

While Gazopa looked useful and i tried a few searches which i had mixed results with, what REALLY caught my imagination was its ability to allow YOU to draw and look for similar results.









Now, you need to remember I have the artistic ability of a drunk one handed (pawd?) fox wearing a blindfold so it really was a challenge.



Well what do you know, it actually brough something back for my, erm "spider" - and it wasn't too bad - mainly coz it DID actually show a spider (yes, i was as shocked as you are reading).

Look at that last result in the right - a frickin spider. And the others were not too bad - but a 20 points for anyone who can tell me why a LAMP is in there?!

So, being excited i decided to try searching for a more creative drawing. I thought, who will i search for that might come back in their database? Well, @scobleizer came to mind and i've watched enough of his videos to think i could have a good attempt at portrait. As my guide i used a fairly unusual picture of him (at least i'd imagine it's unusual)...
























So below you can see my, erm, portrait (1000 apologies) and the results (using a facial search coz the results without it were just ridiculous).


























It's perhaps fair to say there is still some work when it gets a little more advanced - but that's ok, coz it was the first thing ever to recognize something i drew!

Friday, September 5, 2008

Javascript object instance methods

Some other styles i've seen in the creation of instance methods on Javascript objects are as follows:

NS.test = function() {
NS.test.prototype.doSomething = function()
{
...
};
};


NS.test2 = new function() {
this.doSomething = function() {
...
};
};


(function() {
var test3 = function() {};

test3.prototype = {
doSomething : function()
{
...
}
};

NS.test3.doSomethingStatic = function() {
...
}

NS.test3 = test3;

})();

var test1 = new NS.test();
test1.doSomething();

NS.test2.doSomething();

var test3 = new NS.test3();
test3.doSomething();

NS.test3.doSomethingStatic();

Variable scope in Javascript

I'm from an old skool of JavaScript where it was a nice add-on and everything was done on the server. Sure, i've used all the JQuery, Ajaxy, Prototypey frameworks out there but i never went back to look at the core of the language - well at the moment i'm working on something and writing a lot of script.

So i learned something new. A lot of frameworks were using the following syntax in their libraries..

(function() {
...

})();


It's fair to say i was going mad trying to figure out why. I mean you put an alert in there and it pops up when the script loads. So why bother with the function? Why not just write your code as we used to - directly in the page?

Well, thanks to this paper, I have discovered why and this note is as much for my future reading as yours :)

Turns out that you can locally scope variables within these anonymous javascript functions which more importantly means you don't screw up global vars that are being used elsewhere. So you can confidently write the following ...
var person = 'steven';
(function() {

var person = 'xavier';

alert("Person 1 " + person);

})();

alert("Person 2 " + person);



... and you will get "Person 1 xavier" and "Person 2 steven" - notably the global variable "person" was redeclared within the scope of the anonymous function and the name changed, but as it was declared with "var" it does not overwrite the global value. If you did NOT use "var" in the function, you would change the value globally. So it works like every other programming language - just a wee bit different in term of syntax :)

Wednesday, September 3, 2008

V8, Chrome and the DOM

V8 is the Javascript engine for Chrome, the new Google web browser. It is very fast.

V8 does not provide DOM or XML DOM support. This was confirmed to me by a member of the team at the V8 google group.

This leaves me with a question i hope someone can answer.

Most of the scripts i write either use the DOM (getElementById) or the Xml Dom (childNodes[0].nodeValue) and so on. That is a LARGE part of my scripts.

So if the engine does not provide support for this, what happens to my scripts? Do they go really fast until they need to query the DOM and then slow down? The net effect in my case could be a lot of slowing down. I my experience, querying the DOM is the slowest part of your scripts (unless you write non-typical).

Haven't had time (and won't today) to run tests so if you know the answer please let me know.

Tuesday, September 2, 2008

"Google Suggests ..." - don't use Chrome :)

I had to put this one up. I was in Chrome and tried to change my pic in the Groups and here's what i got (changeing pic works fine in friendfeed, for example).


my first chrome weird thing

This may just chance, but the only site i can't get to is webkit.org - which Chrome runs on top of :)

Not only that, it thinks it is "null" (other sites that can't actually be found give a DNS error).

Anyone else get to http://webkit.org ?


Chrome Auto-complete Ideas

Whilst some of what i have seen from Chrome is what i would expect of a browser of this generation (much of it is in IE 8 too), there are some things such as the V8 Javascript compiler which looks to be very cool. I am only halfway through the book, but wanted to write a post on auto-complete.

Google say Chrome will mean you don't need to bookmark pages any more and will provide an improved auto-complete. Well, here's my thoughts...

"Auto-complete" should be much more expansive. I rarely remembers URL's and even things i have found and bookmarked can be a pain to find again. In fact i tend to bookmark things and never look at them again. So i may be searching for something i have already found. Auto-complete needs to add techniques to discover information you want to keep track of. Tags in a delicious style are an obvious way. Add to that some kind of personal strength and it may be very useful. So i type in "Javascript Cross Domain problem" and get a view of a set of links i previously found useful - coz let's face it, you'll end up going to google, and in the process of a single search modify the query, view certain pages, like some discard others and repeat. In that *session* I should be able to get back to all the pages i found useful. I don't mean caching the pages, i mean the semantic link to those pages.

Can you imaging a little package of links you found useful related to a search concept that you can reuse in the future. I would LOVE this. So where Chrome shows a new tab with the top links it it, the auto-complete concept would extend to the search package i discussed above.

ps. stuff such as gears, prototype and so on should be part of the environment itself... maybe that is later in the comic :)

pps. Is it true that the Joker appears in chapter 5??