A full text searching feature has been added to the magnatune web site.
Now, on the home page, is a search field:
I really admire Amazon's search engine, where the results are organized by type of object (books separated from CDs, etc), but that's fairly hard to do with most full text search engines. Amazon wrote their own.
Also, most full text search engine use the HTML page title to display the search hit, but that's often not helpful. For example, if you search for a song name, and you get a page that says "Artist: albumname" you may wonder why there's a hit there. I think that the search engine result should display titles relevant to the type of object, ie "matched a song named X from the album Y by artist Z"
I also like search to be really fast. Lots of sites make you wait 10 seconds or more for results. On tile.net a million-hits-a-day site I used to run (and sold to an Internet.com funded dot-com company), fully 30% of the page hits were searches, and I wrote a search engine for that one (at that time, all in C, using dbm files and AOLServer). I really liked the relevant, organized results that I could provide.
So, I wrote my own search engine for Magnatune. It searches several types of data, and thinks about them differently.
In order of importance, the Magnatune search engine looks at (and groups results by):
1) band name
2) album name (and band name, so a search for a band gives their albums too)
3) collection names (from http://magnatune.com/collections/) and the album names that are in each collection, so if you search for "Artemis" you'll see that they're in 4 different collections (Woman Singing Electro Pop, Chillout, etc...)
4) Song names (with the search hit linking to the album page, which displays gives the song names, and also linking to the artist's page). This will help people find artists and albums that they remember a fragment of a song name of.
5) Artist bios (a search for "Artemis" shows you what other bands their musicians play in)
For an example, search for "Artemis"
http://my.magnatune.com/search?w=artemis
and you'll get these results:
I plan on adding the documents from the "/info/" section into the search index: currently only information about our artists and their music is in there. That's going to take a bit of work, to clean up the HTML titles so that they're consistently meaningful, as I want to make sure I don't return useless search results.
I'm currently showing a maximum of 5 hits from each category, and displaying a "show all" link when not all the results are given. However, the total number of hits that were found is given (ie, "violin" hits on 58 songs and 21 artist bios)
For the geeks out there who want the gory details, here they are:
1) I'm using Berkeley DB as the underlining database. They have a *great* Tcl (my programming language of choice) toolkit that almost exactly matches their C API.
2) Benchmarking: I'm getting 760 full text searches per second, on mini very-slow mac mini, from a for() loop just exercising the search. In the real world, web server and other overhead brings that down to 98 full text searches per second. However, that's plenty fast. I even leave noise words like "the" and "and" in, since they're handled quickly and because "the kokoon" then can find artist names with both ("the" is not noise in an album or artist name)
3) the underlining web application server for the searching is Tclhttpd - the magnatune web site runs on apache. I may switch to AOLserver in the future, as my benchmarks showed 10X faster speeds with mixed HTML/code pages (with Tcl)
4) a separate job creates the indexes, which takes about 3 minutes to build. The web searches work a no-semaphore-lock, read only version of the index files (Berkeley DB really recommends you avoid locking for best performance). Berkeley DB does the caching: I don't cache any results in memory, so that I can handle any size corpus.
5) a simple "inverted index" is used, ie "what pages is this word on? Look up the word, quickly get the pages"
I'm definitely open to feedback and feature requests on this thing. Because it's homemade, I've got a lot of flexibility with what I can do with it.
That's pretty rockin' awesome John.
Posted by: Ryan Sawhill | March 01, 2006 at 04:21 AM
Man, I've been wanting a search engine *forever*. Much appreciated.
Posted by: Topher | March 01, 2006 at 06:08 AM
Kemper Crabb is one of my favorite artists on your site, and I've never been able to find him in a genre. I always click to Metal, find Atomic Opera, and click his name to go to his page.
So I was very happy than now I can search to find him, but I still can't. :(
Searching for "Kemper", "crabb" or "Kemper crabb" all fail.
Posted by: Topher | March 01, 2006 at 06:14 AM
"Kemper Crabb is one of my favorite artists on your site, and I've never been able to find him in a genre."
His solo albums used to be on magnatune, but we removed them at his request (for personal reasons, not due to any problem he had with Magnatune). The only thing we have from Kemper now is as a member of Atomic Opera, as you already noticed.
Posted by: John Buckman | March 01, 2006 at 06:52 AM
Ah, that's very sad, I've suggested those albums to quite a few people who bought them. :( Ah well.
Posted by: Topher | March 01, 2006 at 10:45 AM
The /info/ section is now covered by the magnatune search engine. This has the added benefit of pointing out our numerous press mentions in many search results
ie, a search for 'licensing' now returns useful results:
http://my.magnatune.com/search?w=licensing
There are a few duplicate page titles that we need to fix, you may see them in the results.
Posted by: John Buckman | March 02, 2006 at 08:28 AM
AOLServer is OK. I had it running in test a few years ago when comparing but our languages had better apache links.
Downside is exactly that - you will be restricting yourself to the kinds of addins available by switching off apache IMO.
Extra performance is good but is the site suffering at the moment ? From the server POV or the user one ?. If not then are you doing like me and looking at a technical solution to a nonexistant problem ????.
Posted by: Matthew Bowden | March 02, 2006 at 08:30 AM
Extra performance is good but is the site suffering at the moment ? From the server POV or the user one ?. If not then are you doing like me and looking at a technical solution to a nonexistant problem ????.
Web site speed is nice now, it's just that when I roll out a brand new feature like search, that uses a different infrastructure (tcl and berkeley db) I like to review my options. I've used tclhttpd a lot in the past (it powers the musician's report pages) so I decided to just stick with that. I'll just have to see what happens if we get slashdotted.
However, if I'd written the full text search in mysql, it would be slooooowww... I've dont that before, and it's about 1/100th the performance of berkeley db, which turns out to be like 2-3 seconds per search, when not busy.
Posted by: John Buckman | March 02, 2006 at 08:38 AM
Yes I can see running the search code within mysql would kill performance so berkely DB is sensible. 98 per second even with load is nice and fast.
btw. Recommend Racine other side of Harrods from Harvey Nichols for eating (saw Jan's blog). Good french bistro food at reasonable prices (for London). Its about 5 mins walk from Harrods on the same side of the A4 Cromwell Road.
Posted by: Matthew Bowden | March 02, 2006 at 10:18 AM
DBSight is in Java, which uses Lucene to index any JDBC supported databases.
The demo is here, http://www.dbsight.com
Posted by: Chris Lu | March 10, 2006 at 01:49 PM