A full text searching feature has been added to the magnatune web site.
Now, on the home page, is a search field:
I really admire Amazon's search engine, where the results are organized by type of object (books separated from CDs, etc), but that's fairly hard to do with most full text search engines. Amazon wrote their own.
Also, most full text search engine use the HTML page title to display the search hit, but that's often not helpful. For example, if you search for a song name, and you get a page that says "Artist: albumname" you may wonder why there's a hit there. I think that the search engine result should display titles relevant to the type of object, ie "matched a song named X from the album Y by artist Z"
I also like search to be really fast. Lots of sites make you wait 10 seconds or more for results. On tile.net a million-hits-a-day site I used to run (and sold to an Internet.com funded dot-com company), fully 30% of the page hits were searches, and I wrote a search engine for that one (at that time, all in C, using dbm files and AOLServer). I really liked the relevant, organized results that I could provide.
So, I wrote my own search engine for Magnatune. It searches several types of data, and thinks about them differently.
In order of importance, the Magnatune search engine looks at (and groups results by):
1) band name
2) album name (and band name, so a search for a band gives their albums too)
3) collection names (from http://magnatune.com/collections/) and the album names that are in each collection, so if you search for "Artemis" you'll see that they're in 4 different collections (Woman Singing Electro Pop, Chillout, etc...)
4) Song names (with the search hit linking to the album page, which displays gives the song names, and also linking to the artist's page). This will help people find artists and albums that they remember a fragment of a song name of.
5) Artist bios (a search for "Artemis" shows you what other bands their musicians play in)
For an example, search for "Artemis"
and you'll get these results:
I plan on adding the documents from the "/info/" section into the search index: currently only information about our artists and their music is in there. That's going to take a bit of work, to clean up the HTML titles so that they're consistently meaningful, as I want to make sure I don't return useless search results.
I'm currently showing a maximum of 5 hits from each category, and displaying a "show all" link when not all the results are given. However, the total number of hits that were found is given (ie, "violin" hits on 58 songs and 21 artist bios)
For the geeks out there who want the gory details, here they are:
1) I'm using Berkeley DB as the underlining database. They have a *great* Tcl (my programming language of choice) toolkit that almost exactly matches their C API.
2) Benchmarking: I'm getting 760 full text searches per second, on mini very-slow mac mini, from a for() loop just exercising the search. In the real world, web server and other overhead brings that down to 98 full text searches per second. However, that's plenty fast. I even leave noise words like "the" and "and" in, since they're handled quickly and because "the kokoon" then can find artist names with both ("the" is not noise in an album or artist name)
3) the underlining web application server for the searching is Tclhttpd - the magnatune web site runs on apache. I may switch to AOLserver in the future, as my benchmarks showed 10X faster speeds with mixed HTML/code pages (with Tcl)
4) a separate job creates the indexes, which takes about 3 minutes to build. The web searches work a no-semaphore-lock, read only version of the index files (Berkeley DB really recommends you avoid locking for best performance). Berkeley DB does the caching: I don't cache any results in memory, so that I can handle any size corpus.
5) a simple "inverted index" is used, ie "what pages is this word on? Look up the word, quickly get the pages"
I'm definitely open to feedback and feature requests on this thing. Because it's homemade, I've got a lot of flexibility with what I can do with it.