Ignoring diacritics when searching in PostgreSQL

Questions and answers regarding general SQL and backend databases

Ignoring diacritics when searching in PostgreSQL

Postby ROCLASI » Sun Apr 24, 2016 1:09 pm

At ServoyCamp 2016 I did a talk about database indexing and someone had a question about being able to search for records that may have diacritics in them.
I couldn't answer that question so I looked a little deeper into this issue and found the following:

PostgreSQL does come with an function that is part of an extension to remove diacritics based on a dictionary.
To use this you simply install the extension into your database using the following SQL:
Code: Select all
create extension unaccent

After this you can test it out like so:
Code: Select all
select 'Hôtël', unaccent('Hôtël'), ('Hotel' = unaccent('Hôtël'))

Which results in
Code: Select all
Hôtël   Hotel   true

The only downside is that the function is not immutable and therefor can't be using in a functional index. This is due to the fact it uses an external dictionary which you can alter and therefor the function may not produce the same result over time.
So using this function in a WHERE clause will always result in a table scan.

There are 2 other options though.
1) the most simple solution is to create an extra column in your table that stores a copy of your accented data but without the diacritics and place a regular index on this.
So when you insert the data you also store the result of the unaccent() function in you extra column. When searching you use the unaccented column but return the accented data.
This is an easy solution for when you have a single column that needs this treatment. But it's not ideal when you have many of them.

2) Another option would be to write your own (immutable) function to strip all the diacritics and use that in a functional index.
I found the name2uri function that does a little to much for our purpose so I altered it a bit.
Execute the following SQL in your database to add the function:
Code: Select all
CREATE OR REPLACE FUNCTION zappdiacritics(text)
RETURNS text
IMMUTABLE
STRICT
LANGUAGE SQL
AS $$
SELECT translate(
    $1,
    'áàâãäåāăąèééêëēĕėęěìíîïìĩīĭḩóôõöōŏőùúûüũūŭůäàáâãåæçćĉčöòóôõøüùúûßéèêëýñîìíïş',
    'aaaaaaaaaeeeeeeeeeeiiiiiiiihooooooouuuuuuuuaaaaaaeccccoooooouuuuseeeeyniiiis'
);
$$;

So when you use the same type of query:
Code: Select all
select 'Hôtël', zappdiacritics('Hôtël'), ('Hotel' = zappdiacritics('Hôtël'))

You get the same result as with unaccent().

Now you can use it in a index like so:
Code: Select all
create index myColymn_unaccented_idx on myTable (zappdiacritics(myColumn))


So the following SQL will be able to use the index.
Code: Select all
SELECT * FROM myTable WHERE zappdiacritics(myColumn)=?


Hope this helps.
Robert Ivens
SAN Developer / Servoy Valued Professional / Servoy Certified Developer

ROCLASI Software Solutions / JBS Group, Partner
Mastodon: @roclasi
--
ServoyForge - Building Open Source Software.
PostgreSQL - The world's most advanced open source database.
User avatar
ROCLASI
Servoy Expert
 
Posts: 5438
Joined: Thu Oct 02, 2003 9:49 am
Location: Netherlands/Belgium

Re: Ignoring diacritics when searching in PostgreSQL

Postby Ruben79 » Mon Apr 25, 2016 10:15 am

This is awesome and very helpful!

Thanks Robert!
Ruben de Jong
Stb Software Development
SAN Partner

Stb Software Development - http://www.stb.nl
User avatar
Ruben79
 
Posts: 97
Joined: Wed Apr 18, 2007 12:43 pm

Re: Ignoring diacritics when searching in PostgreSQL

Postby Roberto Blasco » Mon Apr 25, 2016 7:12 pm

Thanks a lot Robert !!!!!

That's just what I was looking for :-)
Un saludo. Roberto.

Madrid - Spain
Tfno: (+34) 625653066
E-mail: roberto.blasco.serrano@gmail.com
User avatar
Roberto Blasco
007
 
Posts: 355
Joined: Tue Apr 08, 2008 7:18 pm
Location: Madrid / Spain

Re: Ignoring diacritics when searching in PostgreSQL

Postby achiary » Mon Apr 25, 2016 9:58 pm

Thanks Robert ! It is very helpful in spanish !
achiary
 
Posts: 83
Joined: Wed Nov 14, 2007 3:29 pm
Location: Argentina

Re: Ignoring diacritics when searching in PostgreSQL

Postby edwin.boon » Wed Mar 06, 2019 3:20 pm

Thanks for the solution. This solved our challenge with the postgresql unaccent function :D
edwin.boon
 
Posts: 19
Joined: Fri Jul 29, 2016 11:40 am


Return to SQL Databases

Who is online

Users browsing this forum: No registered users and 2 guests