It all started with a simple post from Jason Haley while I was on vacation with my family at Emerald Isle.  The post was about creating a link blog aggregator to show the days top picks and diamonds in the rough with or without having to sift through everyone's mention of X.  Like I said, I was on vacation and trying really hard to stay disconnected.  I managed to slip in a few minutes here and there while my youngest daughter took her naps throughout the day and played a bit with Hpricot trying to parse some feeds.

I put things on hold, finished vacation and got caught up with a couple other things.  Finally I came back and decided to crank out a 1.0 version (no beta's here).  The result is hushchamber.com

It's another ruby on rails project for me with some interesting challenges.  I'm using backgroundrb to do all my asynchronous and background processing (mainly fetching new feeds from RSS and parsing those feeds for new links).  I'm using god to monitor backgroundrb and restart it if one of the workers goes down, this normally happens for me with a timeout with open-uri (I think, I can't catch the error in a rescue for some reason).

So what were the challenges?

Parsing was a big one.  Not that it's difficult with Hpricot but some blog clients produce better markup than others.  I'm trying to get only link mentions and not links to other stuff (like the link bloggers blog, or job search sites, or events that are repeated for up to two weeks).  In some instances I had to strip out some content before I could successfully parse it.

Links, surprisingly were another one.  Depending on how the linkblogger linked to content--sometimes through a feedburner feed, sometimes directly to the resource on the web, some with URL shortening services--produced different URLs.  The challenge here was getting all 3 links to the same resource to behave like the same resource.  I ended up having to "resolve" the URL to it's original web form and use that as the unique identifier for the link.

Titles were another issue, titles for links that is.  I started using the inner html for the anchor tag as the text/title for the link itself.  So in the following Example I'd grab "Jeff's Blog" and set that as the text to use for the link.

  1: <a href="http://thequeue.net/blog">Jeff's Blog</a>

The problem is, some titles have <font> and <strong> tags in them, and I really didn't want that.  Other times it's just that link bloggers description of the resource which may or may not make sense out of the context of the blog entry itself.

I played with the idea of going out to the resource directly and using hpricot again to grab the <title> attribute on the page.  I actually did that but 90% of the sites I tried that on have the real title plus some other information about the site as well.  It's really bad when it's a Code Project link, the title++ portion is something like an extra dozen words.  I ended up going back to just using the linkbloggers inner html, and it's first in wins.

Top links and diamonds in the rough

Surprisingly there weren't as many links mentioned from a bunch of linkbloggers as I thought there would be.  Even on jQuery Monday there was a lot of talk about jQuery and Visual Studio and ASP.NET MVC but they were all to different posts.

image

Which brings up another interesting suggestion, or thought, to display links grouped by tag.  I've played around with going out to del.icio.us (this is still faster than their new URL for me to type) and grabbing the top tag for a given link and grouping by that tag.  We're left at the mercy of the initial taggers of a resource but most of the time they're close.  Though it doesn't help for a completely unrelated bunch of links.

So, check it out: http://hushchamber.com

There's also an RSS Feed for it that you can subscribe to and get the last 30 days or so worth of links, if old posts come in you do get updates to RSS.  let me know what you think by using the feedback button:

image