Data Store Shards in Google App Engine?

The picture on the right shows the unique id numbers of some of the shortened URLs in the ur.ly database, sorted by created date. Those unique ids are automatically generated by Google App Engine's data store. Surprised by what you see?

Most databases make it easy to generate auto-incrementing id numbers as keys for database access. At first glance, it's surprising that the ids generated by GAE are not in order. They aren't random, but there are some interesting patterns. This shouldn't surprise us - what we're seeing is one way that the data store makes scaling possible.

It looks like the data store is partitioned or sharded so that different groups or sets of items live in different databases. Ids 28-33 live in one place while 14-18 live in another. Each shard is responsible for generating its own unique ids, and the range of ids a given shard can generate is somehow limited so ids from different servers won't collide (see the auto_increment_increment and auto_increment_offset variables in MySQL for something similar). I also assume that ids are distributed (think memcached's consistent hashing) so finding the correct shard for an id is quick. If ur.ly ever gets really busy, it'll be interesting to look for evidence of a larger number of shards, perhaps dynamically allocated in response to need.

Comments

ur.ly - Dang Short Urls Powered by Google App Engine

Google App Engine let's you build web apps that run on Google's infrastructure. What's the best way to get familiar with a new framework like this? Build something -- preferably something simple and useful, and that's what I set out to do.

I've played around in the URL-shortening space pioneered by TinyURL and understand the problem domain well. I've built a couple of these in PHP and Ruby (merb) and registered the ur.ly domain in September 2007 but never released anything. When GAE launched, memories of Steve Rubel's Could a Billion TinyURL's Go 404? post (hat tip to Dave Winer) echoed in my brain. Why not build a URL-shortener on GAE and let Google worry about scalability?

So I built ur.ly, a simple, super-scalable (thanks GAE), fast (yeah memcached) URL-shortening application. There's an API to make it easy to use and it's open source so you can play with the code or run your own.

Feedback anyone?

Comments (4)

2008 Pasadena Marathon Course Posted

The Pasadena Marathon posted its course map today. After running the Long Beach, Big Sur and Los Angeles marathons, there is something really cool about running a course that is never more than 5 or 6 miles away from my house.

Comments (1)

Volunteer Pumpkin

I'm a composting geek. We compost all of our non-meat table scraps, coffee grounds, and some yard waste and then dig it into our garden soil each year as a natural fertilizer. It's always fun to see what "volunteers" pop up from that compost. After Halloween we composted our jack-o-lanterns, and this year our favorite volunteer is a pumpkin vine, complete with one little pumpkin.

I wish coffee plants would volunteer. Hmmm.

Comments (2)

Triple Buttons with Firefox 3.0 Beta 5


Just upgraded to Firefox 3.0 beta 5... do you think I have enough back/forward buttons? Revert!

Comments (7)

Find Similar Links on LinkRiver

I've been noodling on this feature for a while -- how can I find "more links like this one" in LinkRiver. Putting on my machine learning hat, I contemplated link-to-link co-visitation schemes, semantic indexing, various clustering algorithms... but all approaches were too data-heavy, at least for now. There had to be an easier way...

LinkRiver has allowed full-text searching links (by river and stream) for a while now. The link title and host (i.e. www.techcrunch.com) are both a part of the index. Could the full-text search engine help out here? Let's try it out.

One popular link today was a story on news.com about the possibility of eBay selling Skype to Google. What if I send the link host and title to the search engine? Are the results relevant?

Try it yourself: Click to see similar links

In most cases this works really well...

Twobile-Twitter for Windows Mobile
FriendFeed Has Search

But sometimes, the results are not so great:

TechMeme Leaderboard: Six Months In

Options - one thing I may do, depending on feedback, is stop including the link host as a part of the search query. Play around (click similar, then re-run the search after removing the link host from the search box) and let me know what you think.

Comments (3)

What Powers the Aggregators?

All lifestream and link-sharing aggregators use an RSS/ATOM parser to help power their service.

I built LinkRiver using Ruby on Rails and would have preferred to use a parser built in Ruby. However, Mark Pilgrim's Universal Feed Parser is rock-solid and very well tested, so I use UFP for feed parsing. LinkRiver controls UFP via a memcached-based message queue. Some UFP-Python glue posts new shared links via a simple HTTP API.

A while back RSSMeme's Benjamin Golub tweeted that he also uses UFP, so I thought I'd ask around to see what some of the other aggregators are using.

Bret Taylor from FriendFeed told me they use UFP as a fall-back but rely primarily on a custom parser that uses much less memory.

ReadBurner developer Alexander Marktl replied to say that he uses a MagicParser, a commercial parser for PHP.

After testing a bunch of options and finding none that worked, Tumblr's Marco Arment wrote his own parser for PHP "with regular DOM functions".

Google's Chris Wetherell has blogged about the history of Google Reader and mentioned that UFP was involved, at least in the early stages.

Any others?

Updated: See comments -- Gabe Rivera from Techmeme built his own in Perl.

Comments (4)

Favorite Firefox Extension - Tabs Open Relative

One feature I've missed since abandoning NetCaptor for Firefox a few years ago was the ability to open new tabs next to the current tab instead of at the end of my tab stack. I spent an hour white-boarding this with Firefox dev Ben Goodger, and I gave up trying to do this myself after finding the Firefox tab-ordering code to be a spaghetti-mess of independent arrays.

I don't remember how I stumbled on Tabs Open Relative... but all is well in my tabbed browsing world again -- as if some annoying background music is gone. Ahhhhh.

Why is this feature so important? Context. When you open new tabs, they tend to be related to the current tab. If I'm searching Google for digital camera reviews and open the top five links as separate tabs, those tabs should be close to the "starter" tab, not lost at the end.

Comments (6)

Save Links for Later on LinkRiver

This happens to me all the time. I'm in super-productive mode and I run across an article or blog post that is interesting but entirely outside the context of what I'm doing. I need to stay on task - no tangents allowed.

I've tried a few things... a 'To Read' folder in my browser's bookmarks or tagging links 'toread' on del.icio.us, but these methods were either too disruptive or difficult to manage.

I tried out InstaPaper the other day and loved it - one-click and a link is saved for later. It worked great, but it didn't help me if I found something to 'later' when in Google Reader. Still too much friction.

Inspired by InstaPaper, I added a 'Save for Later' feature to LinkRiver.

Big

Links you mark 'Later' show up under your 'Later' tab in LinkRiver. These links are private and not shared with your followers unless you choose that explicitly.

Bookmarklets

Three Ways to Save Links for Later

There are three ways to add links to your 'Later' stream.

First - there is a new one-click bookmarklet you can add to your browser toolbar. One-click -- boom -- you've saved the link for later without leaving the page you are on. Look for these in your sidebar after logging in to LR.

Later Link

Second - links inside LinkRiver now have a 'later' option in addition to the 'share' option that's been there for a while. Again - one-click and its saved for later.

Third - this one is probably the most powerful of them all - you can import an external feed into your 'Later' stream.

Big

I setup LinkRiver to import my Google Reader shared items into my main stream and my starred items into my Later stream. This works beautifully, especially when using Google Reader on my iPhone. Just click 'share' in GR to share on LinkRiver, or 'star' to save it for later. Sweet GTD goodness!

Comments (3)

Nice to be Noticed

Google Reader creator Chris Wetherell is writing a great series on the birth of Google Reader. In the latest, Chris mentions LinkRiver and others when he talks about "services aggregating shared items".

Gotta say I'm honored. That's kind of like UCLA basketball coach Ben Howland mentioning me, a church-league pee-wee basketball coach, in a post-game news conference.

Comments (2)

« Previous entries