Text search has become an essential requirement for any modern web application. Often we require advanced search capabilities in our web application which cannot be fulfilled with default search operators in a database and that is when Full-Text Search (FTS) comes to the rescue. It performs linguistic searches against text data and operates on words and phrases based on rules of a specific language. Elasticsearch is a robust and high-performance full text search engine. PostgreSQL also provides full text search facility. FTS features available in PostgreSQL may not be as powerful as Elasticsearch but if you are using PostgreSQL as a primary database and want to use Elasticsearch for FTS in your system then you will have to put in place some kind of synchronization mechanism between the two. In some cases, you may require those powerful features of Elasticsearch but in many others, you simply won’t need them.
The PostgreSQL FTS feature is good enough for what you want to do and since they are easily available with PostgreSQL database, you don't need to put in extra efforts in installing, configuring and maintaining another DB for FTS.
In this blog, I will walk you through full text search functionality provided by PostgreSQL database and will discuss how to use it along with Hibernate.
In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. The full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.-- Wikipedia
PostgreSQL full text search (FTS) is an advanced search option where results cannot be retrieved through "=", "LIKE" or "~" operators. PostgreSQL FTS provides following advanced features:
A document is the unit of searching in a full-text search system. In PostgreSQL context, a document is generally a text field within a row of a database table or combination of such fields from several tables obtained dynamically. A document can also be constructed from various parts for indexing and may not be stored anywhere else as a whole. It is not related to table schema but to data and may be a logical entity. A document needs to be formatted into lexemes which are understood by PostgreSQL for full text search support.
“A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token”.PostgreSQL provides two data types to support full-text search, one is tsvector and another is tsquery type.
PostgreSQL provides tsvector data type for storing preprocessed documents. For text search purpose, each document needs to be reduced to preprocessed tsvector format. PostgreSQL performs searching and ranking of results entirely on tsvector representation of a document. A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word look alike. For example converting uppercase letters to lowercase, removing suffixes which allows searches to find variant forms of a word.
For example:
SELECT 'A fat dog jumped over the lazy cat'::tsvector;
tsvector
----------------------------------------------------
'A' 'cat' 'dog' 'fat' 'jumped' 'lazy' 'over' 'the'
Type tsquery represents a text query. A tsquery value stores search terms, which must be already-normalized lexemes and may combine multiple Boolean operators.
For example:tsquery
SELECT 'dog & cat'::tsquery;
tsquery
---------------------------
'dog' & 'cat'
SELECT 'dog & (cat | rat)'::tsquery;
tsquery
---------------------------
'dog' & ( 'cat' | 'rat' )To implement full text search there must be a function to create a tsvector from a document and a tsquery from a user query.
PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type. The to_tsvector parses and normalizes a document string into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document.
For example:SELECT to_tsvector('A fat dog jumped over the lazy cat');
would result in
to_tsvector
----------------------
'cat':8 'dog':3 'fat':2 'jump':4 'lazi':7
Here stop-words “A” and “over” were removed, and the words “jumped” and “lazy” were converted to their stems.
Function to_tsquery is used to convert user-written text into a proper tsquery format. The function to_tsquery takes one argument, a string containing the tokens to search for. We can provide an optional argument to this function.
The to_tsquery function does not accept a string of text, it requires a string of tokens separated by operators. It then goes and creates a true tsquery to retrieve the results.
For example:SELECT to_tsvector('A fat dog jumped over the lazy cat') @@ to_tsquery( 'jumping') would result in t, meaning that a match is found.
Function plainto_tsquery is used to convert a string of text into a proper tsquery format. plainto_tsquery transforms unformatted text querytext to tsquery. The text is parsed and normalized much as for to_tsvector, then the & (AND) Boolean operator is inserted between surviving words.
For example:SELECT to_tsvector('A fat dog jumped over the lazy cat') @@ plainto_tsquery('jumping over'); would result in t.
Observe we have not used operator between search words, now it is a simple search string.
Here, the plainto_tsquery function converts this string to a list of lexemes separated by & (AND) operator. The only limitation with this function is that it can separate words(lexemes) with &(AND) operator only.
PostgreSQL full text search is based on the match operator @@, which returns true if a tsvector matches a tsquery. By default, PostgreSQL uses ‘english’ as text search configuration for the function to_tsvector and it will also ignore English stop words.
PostgreSQL full text search (FTS) with Hibernate:
There are many approaches available to capitalize on PostgreSQL FTS capabilities with Java web application, however, we shall restrict our discussion to use PostgreSQL FTS with Hibernate HQL.
As mentioned earlier, PostgreSQL makes use of @@ operator for full text search, If we want to support this in hibernate or HQL queries then we need to modify our current PostgreSQLDialect to support special character '@@' for full text search.
public class PostgreSQLFTSFunction implements SQLFunction {
@Override
public boolean hasArguments() {
return true;
}
@Override
public boolean hasParenthesesIfNoArguments() {
return false;
}
@Override
public Type getReturnType(final Type firstArgumentType, final Mapping mapping) throws QueryException {
return new BooleanType();
}
@Override
public String render(Type firstArgumentType, List args, SessionFactoryImplementor factory) throws QueryException {
if (args== null || args.size() < 2) {
throw new IllegalArgumentException("The function must be passed at least 2 arguments");
}
String fragment = null;
String ftsConfiguration = null;
String field = null;
String value = null;
if (args.size() == 3) {
ftsConfiguration = (String) args.get(0);
field = (String) args.get(1);
value = (String) args.get(2);
fragment = "to_tsvector(" + ftsConfiguration + ", " + field + ") @@ " + "plainto_tsquery(" + ftsConfiguration + ", " + value + ")";
} else {
field = (String) args.get(0);
value = (String) args.get(1);
fragment = "to_tsvector(" + field + ") @@ " + "plainto_tsquery('" + value + "')";
}
return fragment;
}
}Here we are implementing SQLFunction interface. The crucial method here is overridden render method. This method renders the function call as SQL fragment. It generates a valid SQL statement based on the provided input. In this method, we make '@@' operator available to Hibernate.
We need to create a new class and extend PostgreSQL9Dialect class to register our newly created function.public class CustomPostgreSQLDialect extends PostgreSQL9Dialect {
public CustomPostgreSQLDialect () {
registerFunction("fts", new PostgreSQLFTSFunction());
}
}This is the place where we can register some additional full text search functions required on top of those defined in PostgreSQLDialect.
Update following hibernate property:<property name="hibernate.dialect" value="pkg_name.CustomPostgreSQLDialect " />
Assuming Employee class in an entity:List<Employee> list = em.createQuery("select e from Employee e where fts(pg_catalog.english, e.address, :address) = true")
.setParameter("address", "String to be searched").getResultList();Here, "pg_catalog.english" tells PostgreSQL that we will use English dictionary for search. "Address" is the property of Employee class mapped to the corresponding column in the database. One thing to note here is SQL Functions have to return a value and when used in HQL it must be in an expression form. (fts(address, ‘search string’ ) = true).
The full-text search feature provided by PostgreSQL is excellent and quite fast (enough). In this post, we have gone through basic operators provided by PostgreSQL for full text search and how to use hibernate in order to leverage the power of PostgreSQL FTS. This blog is an overview but it should give enough background information to get you started if you are planning to use hibernate with PostgreSQL for full text search.