RSS is dead; long live RSS!

I was quite perplexed to see this article at ZDNet on techmeme, arguing that RSS is a failure. Now, I’ve been relying less and less on Google Reader myself as a source of news as well, but that’s not because of a failure in RSS technology but rather the obsolesence of Google Reader in the Twitter age. Marshall Kirkpatrick of RWW has a response, arguing that RSS isn’t dead but just one of many information-delivery mechanisms he relies on; I think this response misses the point, however. The truth is that RSS has become an infrastructure technology, the glue that binds the web together and makes it useful. Yahoo Pipes was a great example of how RSS could be used to manipulate content, and half the functionality of Twitter itself comes from the ability to use RSS to import content to it. Friendfeed also relies on RSS feeds generated from your social sphere, and Facebook has supported importing of RSS feeds for a while. The point here is that RSS is so prevalent it has become invisible. Yes, you can tap the raw stream of RSS content directly, using Google Reader or equivalent, but that’s like drinking from the firehose. The better approach is to let your social graph do the filtering for you and then present the result as a steady stream (the so-called river of news). That stream is content, but the streambed is RSS.

Related: Dave Winer makes much the same point, that “the Internet is layered.” Also see James Robertson’s comments about the closed tech pundit circle.

whither NewsJunk?

One of my mantras is to rely on others to filter my data in the social web, because the key to improving your signal to noise ratio is not to try and filter the noise, but actually to reduce your signal. That’s a lot harder than it sounds to do. But it’s made a lot easier by genuinely smart filterers like Dave Winer’s NewsJunk, which was an invaluable tool during the election season. Winer basically culled the best and most interesting news stories (by hand) and fed them to a dedicated RSS feed, which then fed into twitter. As a result I often briefed myself on the day’s politics by first checking @newsjunkies rather than wading into my mess of feeds on Google Reader cold. This is why i am genuinely sad to see that Winer is considering pulling the plug on NewsJunk now that the election has ended.

The problem with Web2.0

I intended to write a blog post on this topic, but ended up using Powerpoint oto t organize my thoughts, and then realized that the resulting slideshow mace the post somewhat superfluous. It is a rumination on the problem with web2.0 today (information overload), some solutions, and speculation about where we go from here:

beyond the tag cloud: the tagdex

I think tag clouds are somewhat useless, to be honest. They are a nice way to fill up a bit of space in a sidebar, if you restrict the cloud to the top 25 or so, but unless the writer is imposing a strict taxonomy on themselves, ultimately the size of the cloud will balloon to an unmanageable size. And a tag cloud in a folksonomy makes no sense, because the wide variation in tags is a feature, not a bug. You want the tags to be vast and redundant. It is ok to have a post about Jhumpa Lahiri’s latest novel tagged “book”, “books”, “review”, “Lahiri”, etc. because this increases the points of entry to the content from tag indexing services like technorati, and also increases the intra-blog, inter-post linkages (assuming you are using some variant of a Related Posts plugin that uses tags for determining what is related).

A far better way to think of tags is to consider them as terms in an index. The same kind of index you find at the end of a piece of non-fiction, to be specific. Consider an excerpt from the Index to the book, The Physics of Star Trek, as an example:

excerpt from second page of index to Physics of Star Trek

It’s easy to see how tags could be recruited to “build” an index of this type. The tags would first need to be sorted in alphabetical order, and then listed as a DL-type HTML list with the “page number” (post number). A range of posts coudl be indicated by the usual dash (ex. Bosons, 192-194) and a list of separate posts by commas (Black Star, 15, 51).

That would be the crudest implementation, but quite effective. However you could go further than this. For example, what about the “see also” link? You could simulate this by looking for tags whose usage is highly correlated, like “Lahiri” and “books”. You could literally calculate Pearson’s correlation coefficient between all pairs of tags in the database and store that in a lookup table, which woudl be updated whenever a post is published. Then any tag whose correlation coefficient to the present post is above some threshold (say, > 0.50) would get the “See also” treatment on both tags’ entries.

You coudl even draft categories in wordpress to contribute, by using them as “tags” in their own right and lumping them into the regular index build (after all, as implemented in WordPress, tags and categories are just redundant taxonomic systems). However, you also might look for correlations between tags and categories, and use the categories as Index parent terms. An example from my own geekblog would be something like

Anime
Ranma
Makoto Shinkai
Someday’s Dreamers
(…)
Geek Service
Asus EEE PC
HDTV
Space
(…)

I had to manually generate the above but it would be far simpler to do it via correlation analysis instead. At any rate, the basic idea is to assign categories as index headings and tags as their cdependents, since presumably categories are more formally taxonomic, and more importantly, fewer. In fact you could do both, treating categories as tags and also giving them higher status as above. You would just need to put a logical test in to exclude a category from appearing as its own parent/child!

Obviously a tag-driven index as above wouldn’t fit in a sidebar. A useful place for it would be its own page, but you might also imagine it embedded on the 404 page. As a standalone, though, it would be a very useful node for search engine optimization, enough so that perhaps it should be called a “tagdex” instead of an index to better distinguish it.

Though useful to any blogger using tags on wordpress, a tagdex would be far more effective on a site whose tags were a genuine folksonomy rather than a taxonomy, since the tag diversity would be greater. However, folksonomy is not a feature of WordPress, unless you use Scott’s awesome WP-Folksonomy plugin (which he wrote in response to my earlier rant about taxonomies and folksonomies). If a thriving ecosystem of wordpress-based folksonomies can be encouraged to thrive (using Scott’s plugin, or equivalent), that will be a significant step towards the Semantic Web. A tagdex represents a coherent snapshot of all the tag metadata in that site’s folksonomy (or taxonomy). As such, it is something that could be parsed and aggregated by the hypothetical Semantic Search Engine of the future.

Semantic authoring

RWW argues that for the Semantic Web to really take off, content-management systems need to incorporate semantic markup. They argue,

Allowing authors or readers to add tags to articles or posts allows a measure of classification, but it does not capture the true semantic essence of the document. Automated Semantic Parsing (especially within a given domain) is on the way – a la Spock, twine and Powerset – but it is currently limited in scope and needs a lot of computing power; in addition, if we could put the proper tools in the authors’ hands in the first place, extracting the semantic meaning would be so much easier.

For example, imagine that you are building an online repository of content, using paid expert authors or community collaboration, to create a large number of similar records – say, a cookbook of recipes, a stack of electrical circuit designs, or something similar. Naturally, you would want to create domain-specific semantic knowledge of your stack at the same time, so that you can classify and search for content in a variety of ways, including by using intelligent queries.

Ideally, the authors would create the content as meaningful XML text, so that parsing the semantics would be much easier. A side benefit is that this content can then be easily published in a variety of ways and there would be SEO benefits as well, if search engines could understand it more easily. But tools that create such XML, and yet are natural and easy for authors to use, don’t appear to be on their way; and the creation of a custom tool for each individual domain seems a difficult and expensive proposition.

The problem with XML authoring, as the author notes, is that it’s too time-consuming from a user perspective. You’re basically requiring that the user fill out a detailed, unique form on every post or content node.

What’s really needed is a way for the CMS to prefill semantic data for the user, and then let the user tweak it. The prefill would have to come from contextual information (post title keywords, word frequency analysis, link text) and metadata (category, tags). In a way you have a mini-search engine index running against your own post, and giving you “search results” to let you “rank” the sub-content into a structured form. And even then, take pains to hide the XML-ness; instead of showing the user a pile of confusing <blah>blahblah</blah> xml markup, it should provide a cleaner view like

drink: triple latte
cost: $4.50
opinion: sucks, overpriced

where of course the labels (drink, cost, opinion) are mapped to the actual XML containers <drink></drink> etc. The user can edit the list easily, insert or delete labels as they choose, and then hit publish.

To achieve this, you need good metadata. By good, I mean “rich” – it should be noted that tagging alone is actually pretty poor as far as metadata goes because it’s usually only a taxonomy imposed by the author, not a true folksonomy. The advantage of the latter is that the metadata is more variable, giving any semantic algorithm more room to play with. Note that tagging as implemented in WordPress is not a true folksonomy, though a plugin now exists to rectify that. Semantic algorithms will starve on taxonomies alone.

wordpress folksonomy progress

The experiment of adding Scott’s WP_Folksonomy plugin to my blog has been a success so far. My blog, haibane.info, is by no means a giant traffic draw but it does have enough that the userbase has been adding some tags of their own. I have at least one user (Scott himself?) who reliably adds tags to most posts, and there have been others drive-by tagging as well. It’s encouraging to see however that there was a thread at the WordPress support forum asking about folksonomy; I directed them to the plugin asap. Now, a search for the term “folksonomy” will lead people to the same tool, and thus the seeds are sown for more people to use it. Let’s hope hat many more blogs, preferably far larger than mine, embrace and adopt folksonomy this year.

del.icio.us bundle linkrolls

The grandfather of social bookmarking sites is del.icio.us, which basically brought “tagging” mainstream (along with Technorati). Most people I know who use the service end up with unwieldy tag clouds, however, because it’s often hard to enforce a self-discipline on what tags you assign. I’ve spent a lot of time manually pruning my tags but there are still plenty on my tag list that are redundant or obsolete.

There is an option to “bundle” your tags – essentially, tagging a group of tags, to help you organize things better. However, bundles at present are only visible to the user, and do not have a dedicated URL or RSS feed like individual tags do. Using the “+” operator to search for multiple tags, ie http://del.icio.us/azizhp/Iraq+Hillary, functions as an AND operator, whereas to simulate a bundle you’d need an OR equivalent that del.icio.us does not support. As a result, if you want to add a linkroll to your site that only shows tag from a single bundle, you’re out of luck.

However, there is a workaround, albeit a clumsy one: create “container” tags. Then you must manually tag all items in the bundle with the container tag. After doing this, you will be able to access your bundle using the container tag, and can create customized linkrolls accordingly. For example, I created the “2008” container tag for all my tags related to the Presidential candidates.

One caveat: try to avoid naming your container tags identically to the bundle. You can prefix the container tags with the “@” symbol to keep them distinct, or name them entirely differently. This is so that if/when in the near future del.icio.us improves support for bundles there won’t be any namespace collisions between your tags and your bundles. Once that day comes you can simply delete all the container tags if you so wish.

Alas, there still is no way to create a tag cloud from a single bundle, so that still awaits the del.icio.us team’s attention.