Page 1 of 1

Ignoring diacritics when searching in PostgreSQL

PostPosted: Sun Apr 24, 2016 1:09 pm
by ROCLASI
At ServoyCamp 2016 I did a talk about database indexing and someone had a question about being able to search for records that may have diacritics in them.
I couldn't answer that question so I looked a little deeper into this issue and found the following:

PostgreSQL does come with an function that is part of an extension to remove diacritics based on a dictionary.
To use this you simply install the extension into your database using the following SQL:
Code: Select all
create extension unaccent

After this you can test it out like so:
Code: Select all
select 'Hôtël', unaccent('Hôtël'), ('Hotel' = unaccent('Hôtël'))

Which results in
Code: Select all
Hôtël   Hotel   true

The only downside is that the function is not immutable and therefor can't be using in a functional index. This is due to the fact it uses an external dictionary which you can alter and therefor the function may not produce the same result over time.
So using this function in a WHERE clause will always result in a table scan.

There are 2 other options though.
1) the most simple solution is to create an extra column in your table that stores a copy of your accented data but without the diacritics and place a regular index on this.
So when you insert the data you also store the result of the unaccent() function in you extra column. When searching you use the unaccented column but return the accented data.
This is an easy solution for when you have a single column that needs this treatment. But it's not ideal when you have many of them.

2) Another option would be to write your own (immutable) function to strip all the diacritics and use that in a functional index.
I found the name2uri function that does a little to much for our purpose so I altered it a bit.
Execute the following SQL in your database to add the function:
Code: Select all
CREATE OR REPLACE FUNCTION zappdiacritics(text)
RETURNS text
IMMUTABLE
STRICT
LANGUAGE SQL
AS $$
SELECT translate(
    $1,
    'áàâãäåāăąèééêëēĕėęěìíîïìĩīĭḩóôõöōŏőùúûüũūŭůäàáâãåæçćĉčöòóôõøüùúûßéèêëýñîìíïş',
    'aaaaaaaaaeeeeeeeeeeiiiiiiiihooooooouuuuuuuuaaaaaaeccccoooooouuuuseeeeyniiiis'
);
$$;

So when you use the same type of query:
Code: Select all
select 'Hôtël', zappdiacritics('Hôtël'), ('Hotel' = zappdiacritics('Hôtël'))

You get the same result as with unaccent().

Now you can use it in a index like so:
Code: Select all
create index myColymn_unaccented_idx on myTable (zappdiacritics(myColumn))


So the following SQL will be able to use the index.
Code: Select all
SELECT * FROM myTable WHERE zappdiacritics(myColumn)=?


Hope this helps.

Re: Ignoring diacritics when searching in PostgreSQL

PostPosted: Mon Apr 25, 2016 10:15 am
by Ruben79
This is awesome and very helpful!

Thanks Robert!

Re: Ignoring diacritics when searching in PostgreSQL

PostPosted: Mon Apr 25, 2016 7:12 pm
by Roberto Blasco
Thanks a lot Robert !!!!!

That's just what I was looking for :-)

Re: Ignoring diacritics when searching in PostgreSQL

PostPosted: Mon Apr 25, 2016 9:58 pm
by achiary
Thanks Robert ! It is very helpful in spanish !

Re: Ignoring diacritics when searching in PostgreSQL

PostPosted: Wed Mar 06, 2019 3:20 pm
by edwin.boon
Thanks for the solution. This solved our challenge with the postgresql unaccent function :D