http://www.monkeyatlarge.com/projects/drillable-stacked-time-series/
Tsearch, the full text search engine in PostgreSql, is great at rapidly searching for keywords (and combinations of keywords) in large bodies of text. It does not, however, excel at matching multi-word phrases. There are some techniques to work around that to let your application leverage tsearch to find phrases.
Before I go on, I’ll credit Paul Sephton’s Understanding Full Text Search for opening my eyes to some of the possibilities to enable phrase search on top of tsearch’s existing capabilities.
Tsearch operates on tsvectors and tsqueries. Tsvectors are a bag of words like structure – a list of the unique words appearing in a piece of text, along with their positions in the text. Searches are performed constructing a tsquery, which is boolean expression combining words with AND(&), OR(|), and NOT(!) operators, then comparing the tsquery against candidate tsvectors with the @@ operator.
select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub';
will match articles where the the body contains the word meatball and the word sub. If there’s an index on to_tsvector(‘english’,articles.body), this query is a very efficient index lookup.
Single Phrase Search
Now how do we match articles with the phrase “meatball sub”, anywhere in the article’s body? Doing the naive query
select * from articles where body like '%meatball sub%'
will work, but it will be slow because the leading wildcard kills any chance of using an index on that column. What we can do to make this go fast is the following:
select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub' AND body like '%meatball sub%'
This will use the full text index to find the set of articles where the body has both words, then that (presumably) smaller set of articles can be scanned for the words together.
Multi Phrase Search
It’s simple to extend the above query to match two phrases:
select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like '%meatball sub%' AND body like '%ham sandwich%';
That query can be tightened up using postgres’s support for arrays:
select * from articles where to_tsvector('english',articles.body) @@ 'meatball & sub & ham & sandwich' AND body like ALL('{"%meatball sub%","%ham sandwich%"}')
Stepping back a bit, let’s define create a table called “concepts” to allow users of an application to store searches on lists of phrases, and let’s also allow the user to specify that all phrases must match, or just one of them.
CREATE TABLE concepts ( id serial, match_all boolean, phrases character varying[], query tsquery )
Now we can specify and execute that previous search this way:
insert into concepts(match_all,phrases,query) VALUES(TRUE,'{"%meatball sub%","%ham sandwich%"}','meatball & sub & ham & sandwich'); select articles.*, join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases)));
Where this approach really shines compared with an external text search tools is aggregate queries like counting up matching articles by date.
select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(phrases)) OR (not match_all AND body like ANY(phrases))) group by articles.date
The logic to combine lists of phrases into the appropriate query based on the desire to match any or all of the phrases is easy to write at the application layer. It’s desirable not to have to include the wildcards into the phrase array, and it’s easy to write a function to do that at runtime.
CREATE OR REPLACE FUNCTION wildcard_wrapper(list varchar[]) RETURNS varchar[] AS $$ DECLARE return_val varchar[]; BEGIN for idx in 1 .. array_upper(list, 1) loop return_val[idx] := '%' || list[idx] || '%'; end loop; return return_val; END; $$ LANGUAGE plpgsql;
With that function good to go we can make that long query just a little longer:
select count(distinct articles.id), articles.date from articles join concepts on (concepts.query @@ to_tsvector(body)) AND ((match_all AND body like ALL(wildcard_wrapper(phrases))) OR (not match_all AND body like ANY(wildcard_wrapper(phrases)))) group by articles.date
It’s straightforward to collapse most, if not all of the sql on clause into a plpgsql function call without adversely affecting the query plan – it’s important that the tsvector index be involved in the query for adequate performance.
Further Work
This approach works well for lists of phrases. To support boolean logic on phrases, one approach might be to compile the request down to a tsquery as above, along with a regular expression to winnow down the matches to those containing the phrases.
Makes sense – nice post.
Really cool post.
Respect.