Thursday, December 25, 2008

PostgreSQL: An ultimate strategy for full text search

Full text search is one of the most powerful features in PostgreSQL. In this blog entry, we'll start with a typical text search problem in its simplest form, and see how to implement its query under PostgreSQL. Then we'll evolve the problem bit by bit and see how we can modify our implementation accordingly, until we come up with an ultimate strategy for the text search problem in its most generic form.

In its simplest form, our text search problem would be selecting all users, whose name matches "Andy Williams":

WHERE to_tsvector( @@ to_tsquery('andy | williams')

That was pretty easy. Now, what if we're searching for a query in more than one column, from one or more tables. e.g. Select users whose, or profiles.full_name matches "Andy Williams". In such a case we'll have to use the concatenation operator ('|') to concatenate all the columns we'll search in. Notice using 'coalesce' to pick null values, because concatenating anything with null returns null:

SELECT * FROM users LEFT JOIN profiles
ON = profiles.user_id
WHERE to_tsvector(coalesce(, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')

Both of the previous queries will return some records with no specific order, which is not the case in a typical text search problem. A common requirement would be ordering the results by relevance. i.e. How relevant is a record to the given search query. PostgreSQL offers a function, ts_rank_cd, which evaluates how relevant a vector is relevant to a query.

SELECT users.*, profiles.*, ts_rank_cd(to_tsvector(coalesce(, '') | coalesce(profiles.full_name, '')), to_tsquery('andy | williams')) as rank
FROM users LEFT JOIN profiles
ON = profiles.user_id
WHERE to_tsvector(coalesce(, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')

Now, what if we are looking for "Andy Williams" in both users and their dependants (one to many relation), where joining will yield repeated records? In the simple case, where no relevance order is needed, we'll just eliminate repititions using DISTINCT. It's that simple because we don't care how many times a single record occured in the results, as we're not interested in its relevance:

ON = dependants.user_id
WHERE to_tsvector(coalesce(, '') | coalesce(, '')) @@ to_tsquery('andy | williams')

Now let's look at the more complex, more realistic, most generic case. We need to return users records whose names or dependats' names match "Andy Williams", returning each users result only once, with the most relevant record first. Simple distinction in this case is semantically wrong because a user with two "andy williams" dependants is more relevant than a user with only one.

An approach to solve this problem is to select a new aggregated column with the concatenation of dependants' names, including it in the search, and grouping results by user. The problem about it is that there is no pre-defined string-concatenation aggregate function. PostgreSQL offers a way to define custom aggregate functions, so we can define our own concatenation function. However, we'll also face the group-by limitation of PostgreSQL. that is, all selected columns must appear in the group-by clause, which pretty much complicates the query.

This brings us to the ultimate strategy, which serves all the mentioned requirements, and eliminates the mentioned problems. The solution suggests that all searchable columns are automatically aggregated in a new system-maintained column, by a database trigger or an application-level callback. Then our task would be as simple as searching that extra column for our query. In our example, An application callback could be used to watch over changes done to user (insert, update) and dependants (insert, update, delete). The callback's task is to re-calculate the 'textsearch' column that contains the concatenation of the user name and the names of all his dependants. Then the text-search query will be as simple as follows:

SELECT users.*, ts_rank_cd(to_tsvector(users.textsearch), to_tsquery('andy | williams')) as rank
FROM users
WHERE to_tsvector(users.textsearch) @@ to_tsquery('andy | williams')

Notice that an aggregation overhead is added to the update operations, while the search operations are now releaved from any joins. This is what makes this approach preferable, because select operations are much more frequent than update operations in typical applications. This results in an overall performance boost.

No comments: