Building Web site search for large sites. Part 1 – Sphinx

So . You have a large site and you try to build a search functionality for it.
I’m going to test some opensource solutions that are out there. Of course all of them i will test on TFM/Gnu Linux.

First one is – Sphinx
They have a nice website here: http://sphinxsearch.com/

Installing it is pretty easy: rpm -Uvh sphinx-2.0.6-1tfm.i686.rpm

Now it’s time to configure it:

I’m going to have index / seach a database that have the following fields ( among others ):


| article_id | bigint(20) | NO | PRI | NULL | auto_increment |
| article_title | varchar(255) | NO | | NULL | |
| article_headline | text | YES | | NULL | |
| article_content | text | YES | | NULL | |
| publish_date | datetime | NO | MUL | NULL | |

Configuring sphinx is pretty easy
just edit /etc/sphinx.conf and add your options there

I ended up with a config file like this :

source src1
{
type = mysql

sql_host = 127.0.0.1
sql_user = protected
sql_pass = protected
sql_db = my_site
sql_port = 3306 # optional, default is 3306

sql_query =
SELECT article_id, category_id, UNIX_TIMESTAMP(publish_date) AS date_added, article_title, article_headline,article_content
FROM articles

sql_attr_uint = category_id
sql_attr_timestamp = date_added
sql_query_info = SELECT * FROM articles WHERE article_id=$id
}

index test1
{
source = src1
path = /var/data/test1
docinfo = extern
charset_type = sbcs

}

My database table has 25955 rows in it. Time to index it.
First this i ran indexer to index the database:
bash-4.0# indexer --all
Sphinx 2.0.6-release (r3473)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file ‘/etc/sphinx.conf’…
indexing index ‘test1’…
collected 25955 docs, 84.4 MB
sorted 13.1 Mhits, 100.0% done
total 25955 docs, 84364412 bytes
total 10.252 sec, 8228890 bytes/sec, 2531.64 docs/sec
total 13 reads, 0.048 sec, 3504.2 kb/call avg, 3.7 msec/call avg
total 100 writes, 0.099 sec, 909.9 kb/call avg, 0.9 msec/call avg

According to the documentation that woutld be all and i can search.
This should be done by using : search test . Unfortunately this doesn’t work in this version . After a quick search on google i fired up searchd and start using php api to perform searches .
php test1.php test


bash-4.0# php test.php test
Query 'test ' retrieved 214 of 214 matches in 0.000 sec.
Query stats:
'test' found 289 times in 214 documents

Matches:
1. doc_id=40682, weight=1, category_id=163, date_added=2012-11-06 17:01:00
2. doc_id=40408, weight=1, category_id=164, date_added=2012-10-25 11:24:00
3. doc_id=40354, weight=101, category_id=106, date_added=2012-10-23 12:59:00
4. doc_id=40329, weight=1, category_id=106, date_added=2012-10-22 11:00:00
5. doc_id=40269, weight=1, category_id=141, date_added=2012-10-18 19:19:00
6. doc_id=39784, weight=1, category_id=106, date_added=2012-09-29 08:24:00
7. doc_id=39719, weight=1, category_id=164, date_added=2012-09-26 19:01:00
8. doc_id=39696, weight=1, category_id=167, date_added=2012-09-25 19:25:00
9. doc_id=39651, weight=100, category_id=1, date_added=2012-09-24 10:42:55
10. doc_id=39489, weight=1, category_id=164, date_added=2012-09-16 15:21:00
11. doc_id=39473, weight=1, category_id=164, date_added=2012-09-15 19:15:00
12. doc_id=39182, weight=1, category_id=106, date_added=2012-09-03 09:43:00
13. doc_id=39089, weight=1, category_id=106, date_added=2012-08-30 08:30:00
14. doc_id=38970, weight=101, category_id=106, date_added=2012-08-23 15:47:00
15. doc_id=38946, weight=100, category_id=159, date_added=2012-08-22 14:25:00
16. doc_id=38826, weight=1, category_id=106, date_added=2012-08-17 09:03:00
17. doc_id=38794, weight=102, category_id=159, date_added=2012-08-16 08:02:00
18. doc_id=38728, weight=1, category_id=164, date_added=2012-08-14 06:10:00
19. doc_id=38587, weight=1, category_id=159, date_added=2012-08-07 14:34:56
20. doc_id=38368, weight=1, category_id=159, date_added=2012-07-28 17:00:00

So … The tests for sphinx ended here.
Pro’s:
Php api
Fast
Con’s:
– doesn’t return the fields i need
– needs some work
– having searchd running blocks full reindexing ( there are some real time options that seems to do the trick but i’ll test them another time )

Posted in Uncategorized.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.