Thursday, December 25, 2008

PostgreSQL: An ultimate strategy for full text search

Full text search is one of the most powerful features in PostgreSQL. In this blog entry, we'll start with a typical text search problem in its simplest form, and see how to implement its query under PostgreSQL. Then we'll evolve the problem bit by bit and see how we can modify our implementation accordingly, until we come up with an ultimate strategy for the text search problem in its most generic form.

In its simplest form, our text search problem would be selecting all users, whose name matches "Andy Williams":

SELECT * FROM users
WHERE to_tsvector(users.name) @@ to_tsquery('andy | williams')


That was pretty easy. Now, what if we're searching for a query in more than one column, from one or more tables. e.g. Select users whose users.name, or profiles.full_name matches "Andy Williams". In such a case we'll have to use the concatenation operator ('|') to concatenate all the columns we'll search in. Notice using 'coalesce' to pick null values, because concatenating anything with null returns null:

SELECT * FROM users LEFT JOIN profiles
ON users.id = profiles.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')


Both of the previous queries will return some records with no specific order, which is not the case in a typical text search problem. A common requirement would be ordering the results by relevance. i.e. How relevant is a record to the given search query. PostgreSQL offers a function, ts_rank_cd, which evaluates how relevant a vector is relevant to a query.

SELECT users.*, profiles.*, ts_rank_cd(to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')), to_tsquery('andy | williams')) as rank
FROM users LEFT JOIN profiles
ON users.id = profiles.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(profiles.full_name, '')) @@ to_tsquery('andy | williams')
ORDER BY rank DESC


Now, what if we are looking for "Andy Williams" in both users and their dependants (one to many relation), where joining will yield repeated records? In the simple case, where no relevance order is needed, we'll just eliminate repititions using DISTINCT. It's that simple because we don't care how many times a single record occured in the results, as we're not interested in its relevance:

SELECT DISTINCT * FROM users LEFT JOIN dependants
ON users.id = dependants.user_id
WHERE to_tsvector(coalesce(users.name, '') | coalesce(dependandts.name, '')) @@ to_tsquery('andy | williams')


Now let's look at the more complex, more realistic, most generic case. We need to return users records whose names or dependats' names match "Andy Williams", returning each users result only once, with the most relevant record first. Simple distinction in this case is semantically wrong because a user with two "andy williams" dependants is more relevant than a user with only one.

An approach to solve this problem is to select a new aggregated column with the concatenation of dependants' names, including it in the search, and grouping results by user. The problem about it is that there is no pre-defined string-concatenation aggregate function. PostgreSQL offers a way to define custom aggregate functions, so we can define our own concatenation function. However, we'll also face the group-by limitation of PostgreSQL. that is, all selected columns must appear in the group-by clause, which pretty much complicates the query.

This brings us to the ultimate strategy, which serves all the mentioned requirements, and eliminates the mentioned problems. The solution suggests that all searchable columns are automatically aggregated in a new system-maintained column, by a database trigger or an application-level callback. Then our task would be as simple as searching that extra column for our query. In our example, An application callback could be used to watch over changes done to user (insert, update) and dependants (insert, update, delete). The callback's task is to re-calculate the 'textsearch' column that contains the concatenation of the user name and the names of all his dependants. Then the text-search query will be as simple as follows:

SELECT users.*, ts_rank_cd(to_tsvector(users.textsearch), to_tsquery('andy | williams')) as rank
FROM users
WHERE to_tsvector(users.textsearch) @@ to_tsquery('andy | williams')
ORDER BY rank DESC

Notice that an aggregation overhead is added to the update operations, while the search operations are now releaved from any joins. This is what makes this approach preferable, because select operations are much more frequent than update operations in typical applications. This results in an overall performance boost.

Wednesday, December 17, 2008

HTTP Basic Authentication and Realms

One of the most well known features of HTTP is Basic Authentication. You most probably know how to implement a basic authentication scheme in HTTP if you spent enough time in web development. You usually add a response status code of 401 (unauthorized) and let the web browser prompt the user for authentication. Then the browser will resubmit the authentication header with every subsequent request under the same domain name.

However, using this simple scheme, one must assume that all requests under the same domain name are accessible to the same people (the concept of roles). Suppose that you want to allow user_1 to access a part of the site with his credentials, and user_2 to access another part with OTHER credentials. Using this simple scheme, it can't be done because the whole site is assumed by the browser to be one unit. A user is either authenticated to access the whole site or not at all.

The solution to this problem is using the 'WWW-Authenticate' response header and the 'realm' keyword. This keyword simply tells the client that authentication is needed for a certain realm (or part) of the website.
WWW-Authenticate: Basic realm="site"

If the browser already has an authentication header for that realm, it will resubmit it. otherwise, it won't just submit any authentication header just because it belongs to that domain name. If it doesn't have the authentication header specific to that realm, it will re-prompt the user again for authentication. i.e. If a subsequent response has a header like this:
WWW-Authenticate: Basic realm="administration"

The browser won't resubmit the authentication header of "site" realm. It will re-prompt the user for "administration" realm authentication.

MySQL Vs PostgreSQL: key differences in queries

Coming from a MySQL background, and trying PostgreSQL for the first time, I am experiencing some key differences in query syntax. Some queries that used to work fine under MySQL now produce errors under PostgreSQL. Generally, PostgreSQL's query syntax is tighter, more strict to ANSI SQL. For anyone moving from MySQL to PostgreSQL, these differences will be helpful to know. So I'll post them in groups as I go deeper in PostgreSQL with time.

The first difference I encountered is that aliasing in PostgreSQL requires an explicit 'as'. This means that the following query, that used to work under MySQL, won't work under PostgreSQL:
SELECT count(id) count FROM users;

Instead, an explicit 'as' is needed before the alias:
SELECT count(id) AS count FROM users;


Another difference is the 'group by' issue. In MySQL queries were allowed to group the results by a subset of, not necessarily all, the selected columns. for example, the following query works under MySQL:
SELECT users.name, users.id, count(telephones.id)
FROM users LEFT JOIN telephones
ON telephones.user_id = users.id
GROUP BY users.id;

Such a query doesn't work under PostgreSQL. When grouping, all selected columns (except aggregated ones) must appear in the group-by clause:
SELECT users.name, users.id, count(telephones.id)
FROM users LEFT JOIN telephones
ON telephones.user_id = users.id
GROUP BY users.id, users.name;

This limitation makes it hard to build a query to return results that are distinct based on specific columns. i.e. This way I can't use 'group by' to return distinct results based on users.id only.
However, PostgreSQL comes with a good feature that will help you return distinct results based on some of, not all, the selected columns; 'distinct on' can be used as follows:
SELECT DISTINCT ON (users.id) users.name, users.id
FROM users LEFT JOIN telephones
ON telephones.user_id = users.id;

Nevertheless, It is worth a note that distinct-on clauses can't be used with order-by clauses in case the columns list in both clauses are different. i.e. The following query won't work because the distinct-on columns list is different than the order-by colmns list:
SELECT DISTINCT ON (users.id) users.name, users.id
FROM users LEFT JOIN telephones
ON telephones.user_id = users.id
ORDER BY users.name;


Wednesday, December 3, 2008

HTTP: Forcing download

There are some content types, other than HTML, that web browsers can render, like images, XML, and pdf. When an HTTP response is received, having a content-type from those, the default behaviour of most browsers is to render it, not to download it. To make a browser understand that the response is to be downloaded, not rendered, you must be explicit about it.

Now, how do you do it? There is a standard HTTP response header that many people just don't know about. It's called 'content-disposition'. It's default value for most browsers is 'inline', which causes the browser to render instead of download. To force download, you must specify that your content-disposition' is 'attachment'. You can also add a file name to the downloaded file:
Content-disposition: attachment; filename=name.ext

This way, the browser will understand that the response body is to be downloaded instead of rendered, even if it's a known type that can be rendered.

Rails: has_many documentation fault

I was going through an implementation case when I wanted to add a has_many :through relation from a 'User' model to itself, through a 'Friendship' model. According to the documentation of has_many:
:through Specifies a Join Model through which to perform the query. Options for :class_name and :foreign_key are ignored, as the association uses the source reflection.
The relation should infer the class name and the foreign key from the source option. i.e. If the source is :post, then the class is Post and the foreign key is post_id.

The problem was that I had a special case, where a self many-to-many relation is needed. There was a User-to-User join model called Friendship. Its table has two relevant columns: user_id and friend_id. The association using :source would look like that:
has_many :friends, :through => :friendships, :source => :user

This wouldn't be right because the relation will use 'user_id' to do the join instead of 'friend_id'.

To my surprise, the documentation was not right in this case. Adding the :class_name and :foreign_key options instead of :source worked like a charm.

has_many :friends, :through => :friendships, :foreign_key => 'friend_id', :class_name => 'user'