beyond the tag cloud: the tagdex

I think tag clouds are somewhat useless, to be honest. They are a nice way to fill up a bit of space in a sidebar, if you restrict the cloud to the top 25 or so, but unless the writer is imposing a strict taxonomy on themselves, ultimately the size of the cloud will balloon to an unmanageable size. And a tag cloud in a folksonomy makes no sense, because the wide variation in tags is a feature, not a bug. You want the tags to be vast and redundant. It is ok to have a post about Jhumpa Lahiri’s latest novel tagged “book”, “books”, “review”, “Lahiri”, etc. because this increases the points of entry to the content from tag indexing services like technorati, and also increases the intra-blog, inter-post linkages (assuming you are using some variant of a Related Posts plugin that uses tags for determining what is related).

A far better way to think of tags is to consider them as terms in an index. The same kind of index you find at the end of a piece of non-fiction, to be specific. Consider an excerpt from the Index to the book, The Physics of Star Trek, as an example:

excerpt from second page of index to Physics of Star Trek

It’s easy to see how tags could be recruited to “build” an index of this type. The tags would first need to be sorted in alphabetical order, and then listed as a DL-type HTML list with the “page number” (post number). A range of posts coudl be indicated by the usual dash (ex. Bosons, 192-194) and a list of separate posts by commas (Black Star, 15, 51).

That would be the crudest implementation, but quite effective. However you could go further than this. For example, what about the “see also” link? You could simulate this by looking for tags whose usage is highly correlated, like “Lahiri” and “books”. You could literally calculate Pearson’s correlation coefficient between all pairs of tags in the database and store that in a lookup table, which woudl be updated whenever a post is published. Then any tag whose correlation coefficient to the present post is above some threshold (say, > 0.50) would get the “See also” treatment on both tags’ entries.

You coudl even draft categories in wordpress to contribute, by using them as “tags” in their own right and lumping them into the regular index build (after all, as implemented in WordPress, tags and categories are just redundant taxonomic systems). However, you also might look for correlations between tags and categories, and use the categories as Index parent terms. An example from my own geekblog would be something like

Anime
Ranma
Makoto Shinkai
Someday’s Dreamers
(…)
Geek Service
Asus EEE PC
HDTV
Space
(…)

I had to manually generate the above but it would be far simpler to do it via correlation analysis instead. At any rate, the basic idea is to assign categories as index headings and tags as their cdependents, since presumably categories are more formally taxonomic, and more importantly, fewer. In fact you could do both, treating categories as tags and also giving them higher status as above. You would just need to put a logical test in to exclude a category from appearing as its own parent/child!

Obviously a tag-driven index as above wouldn’t fit in a sidebar. A useful place for it would be its own page, but you might also imagine it embedded on the 404 page. As a standalone, though, it would be a very useful node for search engine optimization, enough so that perhaps it should be called a “tagdex” instead of an index to better distinguish it.

Though useful to any blogger using tags on wordpress, a tagdex would be far more effective on a site whose tags were a genuine folksonomy rather than a taxonomy, since the tag diversity would be greater. However, folksonomy is not a feature of WordPress, unless you use Scott’s awesome WP-Folksonomy plugin (which he wrote in response to my earlier rant about taxonomies and folksonomies). If a thriving ecosystem of wordpress-based folksonomies can be encouraged to thrive (using Scott’s plugin, or equivalent), that will be a significant step towards the Semantic Web. A tagdex represents a coherent snapshot of all the tag metadata in that site’s folksonomy (or taxonomy). As such, it is something that could be parsed and aggregated by the hypothetical Semantic Search Engine of the future.

wavatars updated

Shamus announces an update to his wavatars plugin. However, as noted earlier, the pending release of WordPress 2.5 will likely break most avatar plugins due to its built-in avatar support. It think it makes more sense to wait for the post-upgrade version of wavatars for the time being; I still would like to see a way to define avatar libraries so that instead of two plugins, I could just select from a drop down of avatar styles (wavatars, monsters, etc).

close the barn doors

I thought I was done with this, but it seems that WordPress v2.3.3 did not fix the injection spam loophole; I was just hit by another injection spam attack on my previous post (now cleaned up). I’ve closed user registration on the blog for now, though of course you needn’t register to comment thanks to the captcha plugins I have installed. I suggest that all WP bloggers do the same and keep an eye out for injection spam by monitoring your RSS feed.

WP 2.3.3 does not close injection spam loophole

Over a month ago, I’d upgraded to WordPress v2.3.3 which addressed a security hole that was permitting spammers to “inject” spammy links directly into posts via xmlrpc.php, and thereby avoid the “nofollow” attribute that is automatically applied to links in comments (to deprive comment spammers of the PageRank mojo they seek). The spam was surrounded by “noscript” HTML tags, which meant that they were invisible in the browser, thus hiding the links from detection and removal. However, subscribers to the blog feed can see the spam since RSS readers ignore javascript markup.

However, on my latest post at my geekblog, I was hit by the injection spam again. I have sent the following email to wordpress security (security @ wordpress.org)

Hello,

I have a WordPress blog at domain http://haibane.info which was upgraded to 2.3.3 as soon as the security release came out last month. I had experienced the injection spam attack detailed here:

http://wordpress.org/support/topic/151368

and upgraded to 2.3.3, but on my most recent post I have seen the same spam attack occur. The post is here:

Google 42

and I have already removed the injection spam, but am reprinting it below :

<noscript><a href="http://www.casinomejor. es/casino-online- basico.html">casino online</a> mirar sus oponentes h�bitos.</noscript>

<noscript>Il <a href="http://www.qualitapoker .com/neteller-game-poker.html">http://www.qualitapoker .com/neteller-game- poker.html</a> � un gioco di carte.</noscript>

(there were two separate injections into the same post)

I am disabling user registration as a precautionary measure but it is clear that the 2.3.3 release did not solve the problem.

I recommend closing user registration on all WP blogs for the time being. Peter’s captcha plugins make user registration obsolete for commenting, anyway.

why did MT lose and WP win?

ma.tt responds to Anil Dash by pointing out that WordPress is fully open source:

WordPress is 100% open source, GPL.

All plugins in the official directory are GPL or compatible, 100% open source.

bbPress is 100% GPL.

WordPress MU is 100% open source, GPL, and if you wanted you could take it and build your own hosted platform like WordPress.com, like edublogs.org has with over 100,000 blogs.

There is more GPL stuff on the way, as well. 🙂

Could you build Typepad or Vox with Movable Type? Probably not, especially since people with more than a few blogs or posts say it grinds to a halt, as Metblogs found before they switched to WordPress.

Automattic (and other people) can provide full support for GPL software, which is the single license everything we support is under. Movable Type has 8 different licenses and the “open source” one doesn’t allow any support. The community around WordPress is amazing and most people find it more than adequate for their support needs.

Movable Type, which is Six Apart’s only Open Source product line now that they’ve dumped Livejournal, doesn’t even have a public bug tracker, even though they announced it going OS over 9 months ago!

I think that this gets to the heart of why WP is so successful. WP vs MT is almost a case study of the Cathedral vs the Bazaar. Were Six Apart to fully embrace the open source model, as WP has done, they would of course lose the revenue stream from licensing, but the absence of that stream hasn’t exactly inhibited Automattic ($29.5 million in the latest round…). Matt alludes to the MT3 debacle, which really was a betrayal of MT’s until-then loyal userbase. It came down to simply money; in an era where the best things in (computing) life are free, Six Apart seems determined to charge. And that’s been the thing holding them back. Technology alone isn’t enough, you have to address the user model. That is what MT has failed and seems to continue to fail to do.

blog CMS infrastructure

Moveable Type is making a play for WordPress users to “upgrade”, with Anil Dash firing a broadshot across Automattic’s port side. Dash makes some good points but fails to articulate a compelling reason to switch, primarily because the basic premise is flawed, that WordPress is hard to upgrade and that its architecture is an impediment to ordinary users who seek to extend its functionality or implement their own style and design.

Probably the single biggest reason for WP’s success is the one-click install and one-click upgrade offered by Dreamhost and other web host companies. I can literally setup a WP blog for anyone in less than 3 minutes. Most of that time is post-install customization, as well. The plugin ecosystem is far more vibrant on the WP side than MT, and the proliferation of styles and themes means that the end user need only choose from a bounty of available options if they don’t want to tinker on their own – but tinkering is also very, very easy since the various files can be edited directly from within the online administration pages.

Where MT should focus its poaching efforts is as a competitor to WordPress MU. Thus far, WP-MU remains a complex and daunting installation and maintenance is not simple. However, MU is still attractive, especially because of the new Buddypress functionality that will turn all MU users on a given install into an instant social network. What MT needs to do to grow is not to try and convince the end users with their own WP blogs, but try to create a full fledged blog ecosystem like WordPress.com, and attract users to their platform there. Typepad, built on the previous iteration of MT3, is simply inadequate as a competitor to WordPress.com-hosted free blogs. By providing a new umbrella site for free blogs, MT can build the user base to the critical mass required for increased power user adoption. As things stand, I simply have no incentive to try MT4, and Anil’s PR attempt falls flat since frankly he’s attacking a straw man of WordPress rather than the reality which I deal with every day.

In a few days, I will log into my Dreamhost panel and upgrade my blogs to WordPress 2.5. WP is a moving target. MT4 needs to catch up and then stay abreast. Until it’s as easy for me to install and upgrade MT as it is WP, they aren’t even close.

WP 2.5 has built-in gravatar support

Seems that WordPress v2.5 (which will be out this month) will include support for Gravatars by default:

default avatarTheme Authors: Adding Gravatars to Your Theme

The function to add Gravatars to your theme is called: get_avatar. The function returns a complete tag of the Avatar.

The function get_avatar is setup as follows:

function get_avatar( $id_or_email, $size = '64', $default = '' )

* id_or_email: The author’s User ID (an integer or string) or an E-mail Address (a string)
* size: The size of the Avatar to display (max is 80).
* default: The absolute location of the default Avatar.

That’s the default avatar icon up there. Ugh. I am really not interested in gravatars, I am a fan of Monster ID and Wavatars. I hope Scott and Shamus can update their plugins to hook into the native 2.5 functionality as that would be a lot simpler. Adding a dropdown to the Admin panel to let you select between different icon sets is probably the best approach.

UPDATE: Ryan Boren says that any avatar service can be invoked, not just Gravatar:

Gravatar is the service used by default. get_avatar() is completely pluggable, however, so any service can be used. get_avatar() is built-in so that themes will have some fixed API on which they can rely, regardless of whatever avatar service is being used behind-the-scenes.

wordpress folksonomy progress

The experiment of adding Scott’s WP_Folksonomy plugin to my blog has been a success so far. My blog, haibane.info, is by no means a giant traffic draw but it does have enough that the userbase has been adding some tags of their own. I have at least one user (Scott himself?) who reliably adds tags to most posts, and there have been others drive-by tagging as well. It’s encouraging to see however that there was a thread at the WordPress support forum asking about folksonomy; I directed them to the plugin asap. Now, a search for the term “folksonomy” will lead people to the same tool, and thus the seeds are sown for more people to use it. Let’s hope hat many more blogs, preferably far larger than mine, embrace and adopt folksonomy this year.

injection spam

I’ve upgraded to v2.3.3 which closes a security hole that was permitting spammers to “inject” spammy links directly into posts via xmlrpc.php, and thereby avoid the “nofollow” attribute that is automatically applied to links in comments (ie, the usual mechanism to deprive comment spammers of the PageRank mojo they seek). The spam was surrounded by “noscript” HTML tags, which meant that they were invisible in the browser, thus hiding the links from detection and removal. However, since RSS feedreaders do not interpret javascript, the spam was revealed, and I am grateful to Dave and to Gothmog for alerting me to the problem.

If you have a WP blog you should upgrade ASAP to the latest version. FYI to all the otaku blogs I link to on my blogroll here, I have not noticed any spam links via your feeds, though I am a bit behind on my reading. You all should upgrade asap.