Recent Entries

Rss Feed

Related Entries

How to make a good database index

Monday, December 13th, 2004

Warning: The contents of this entry are extremely nerdy and are not for the faint of heart.

So, I spent a good 3-4 hours making a journal search feature. Oooooh, big deal, right? Well, it was more a proof of concept. I’m working on some advanced real-time databases at work now, and I’ve learned about some things that even though are memory-intense in their raw form, can be used effectively to greatly increase the speed of things.

One of the most popular things to do this is an index. For instance, I can take every blog, find all of the unique words in that blog and then put them in a table with which blog they appeared in. So, I started there.

1, me
1, is
1, cool
2, me
2, is
2, brent
…and so on 36000 times!!!

36000. Yes. Wow. That’s a lot of rows. Each row consists of a 11 byte journal ID (11 bytes is the default int size in MySQL) and the first 20 characters of the words. At 31 bytes a record, it’s no surprise that the table was about a megabyte big. Soon, I realized that over half of the rows in the table were words that appeared in 45 blogs or more. Those words are:

”, ‘2′, ‘3′, ‘:’, ‘a’, ‘about’, ‘actually’, ‘after’, ‘again’, ‘all’, ‘also’
, ‘am’, ‘an’, ‘and’, ‘another’, ‘anyway’, ‘are’, ‘around’, ‘as’, ‘at’, ‘back’
, ‘be’, ‘because’, ‘been’, ‘before’, ‘bethany’, ‘bit’, ‘blog’, ‘brent’, ‘but’, ‘by’, ‘can’, ‘cool’, ‘couple’, ‘day’, ‘days’, ‘did’, ‘didn’, ‘do’, ‘don’, ‘done’
, ‘down’, ‘even’, ‘few’, ‘finally’, ‘first’, ‘for’, ‘from’, ‘fun’, ‘get’, ‘getting’, ‘go’, ‘going’, ‘good’, ‘got’, ‘great’, ‘had’, ‘has’, ‘have’, ‘he’, ‘here’, ‘home’, ‘hours’, ‘how’, ‘i’, ‘if’, ‘in’, ‘into’, ‘is’, ‘it’, ‘just’
, ‘know’, ‘last’, ‘later’, ‘like’, ‘little’, ‘ll’, ‘long’, ‘lot’, ‘m’, ‘made’
, ‘me’, ‘more’, ‘morning’, ‘much’, ‘my’, ‘need’, ‘new’, ‘next’, ‘night’, ‘not’
, ‘now’, ‘of’, ‘off’, ‘oh’, ‘ok’, ‘on’, ‘one’, ‘only’, ‘or’, ‘other’
, ‘our’, ‘out’, ‘over’, ‘people’, ‘pictures’, ‘pretty’, ‘re’, ‘really’, ‘right’, ’s’, ’same’, ’say’, ’school’, ’see’, ’she’, ’should’, ’since’, ’site’, ’so’, ’some’, ’still’, ’stuff’, ‘t’, ‘take’, ‘than’, ‘that’, ‘the’, ‘their’, ‘them’, ‘then’, ‘there’, ‘they’, ‘think’, ‘this’, ‘those’, ‘though’, ‘time’, ‘to’, ‘today’, ‘tomorrow’, ‘too’, ‘took’, ‘two’, ‘until’, ‘up’, ‘us’, ‘ve’, ‘very’, ‘was’, ‘way’, ‘we’, ‘week’, ‘weekend’, ‘well’, ‘went’, ‘were’, ‘what’, ‘when’, ‘where’, ‘which’, ‘while’, ‘who’, ‘will’, ‘with’, ‘work’, ‘working’, ‘would’, ‘write’, ‘yesterday’, ‘you’

I decided to make these words BANNED words because searching for them would yield way too many results. I probably could have made the cut-off at 20 blogs or less per keyword, but I found 45 to be the point of diminishing savings. Removing these words from the index made my table size shrink to 19000 records, a savings of just under a half. Very wrigley!

The next step was to come up with a fast way to search through these records. Luckily, MySQL and every database out there supports a thing called indexing. That is, they store the records in a tree structure that makes searching through N items with B items at each level of the tree take LOG base B of N branches. For a binary tree of 19000 items, that means on average, about 15 choices as to where to go (left or right down the tree). That is much faster than walking through the entire table looking for keyword matches, which would take 19000 looks :). Plus, Mysql is doing all of the work of searching (it’s written in C, yay :thumbsup: ).

To find out how fast my searches were, I used the MySQL programmer’s most handy tool: EXPLAIN. I worked it out so that to discover N blogs that matched my keyword search, I only needed to look at N records in the table of 19000 entries. Holy shit!!! Searching the index takes no time. Here are the queries that I came up with for searching for two keywords. They can be applied to N keywords easily. Say I want to look for the words dog and friday:

ALL keywords must be present
SELECT bsq_journal.journalID, bsq_journal.entrybody FROM bsq_journal_crossref as cref
INNER JOIN bsq_journal on cref.journalID=bsq_journal.journalID

INNER JOIN bsq_journal_crossref as cref1 ON (cref.journalID=cref1.journalID)
WHERE cref.theword=’dog’ AND cref1.theword=’friday’

ANY keywords can be present
SELECT bsq_journal.journalID, bsq_journal.entrybody FROM bsq_journal_crossref as cref
INNER JOIN bsq_journal on cref.journalID=bsq_journal.journalID
WHERE cref.theword=’dog’ OR cref.theword=’friday’

So, now all I have to do is code the interfaces to the queries above in PHP. Each time someone writes in a blog or updates one, it is indexed on the fly in less than .25 seconds. Pretty cool for web programming, eh? :koolaid:

Leave a Reply