Though most databases offer the capability to index the data within tables, PostgreSQL takes it to another level by offering a wide array of index types as well as custom indexes.
What is an Index?
An index is a secondary data structure that provides optimized information on the location of specific data within a table (heap). For example, you may have the following table structure:
CREATE TABLE test_index (id BIGSERIAL PRIMARY KEY, first_name TEXT);
This will create a table with the name test_index containing the columns id and first_name. The id column is automatically generated as a BIGINT using the pseudo-type BIGSERIAL and it is our PRIMARY KEY. The PRIMARY KEY will ensure that each row has a UNIQUE id and that no NULL values are present. When the PRIMARY KEY is specified it also creates a secondary data structure, the PRIMARY KEY, which is also an index.
We can see this by describing the structure of the table:
Table "public.test_index"
[...]
Indexes:
"test_index_pkey" PRIMARY KEY, btree (id)
When reviewing the table structure you can see that PostgreSQL automatically created an index with the name “test_index_pkey.” It is designated as the PRIMARY KEY and is using a b-tree index on the column id.
What good is an index?
The core purpose of an index is to provide faster access to the rows in the table that you are looking for. We can test this by adding data to our table:
INSERT INTO test_index(id,first_name) VALUES (generate_series(1,1000000), 'My First_Name');
Using the INSERT statement we have added one million rows using BIGSERIAL (which will autoincrement) and a text value of ‘My First_name’ for each row. This will provide us with enough data to test the immediate value of an index. Let’s query PostgreSQL and ask for the rows with the id of value 50 - 100:
Index Scan using test_index_pkey on test_index (cost=0.42..9.50 rows=54 width=22) (actual time=0.011..0.021 rows=51 loops=1)
Index Cond: ((id >= 50) AND (id <= 100))
Planning Time: 0.136 ms
Execution Time: 0.039 ms
Using the SELECT with the EXPLAIN command we can see that the query utilized the index that was created with the PRIMARY KEY and that the query took a total execution time of 0.039ms. To show the benefit of the index, let’s run the exact same query with index scans disabled. This will mimic the behavior of not having an index on the table.
Planning Time: 0.160 ms
Execution Time: 51.803 ms
As you can see, even with a fast NVME ssd (my laptop), sequentially scanning through 1 million records to retrieve only 50 is quite a bit slower than utilizing an index.
Indexing options
PostgreSQL offers the widest array of index options available to a database. The index type most used is B-Tree and we will only be discussing it. Other articles may explore types such as GIN.
B-tree
In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children.[2] Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.[1]
A practical advantage of B-Tree is that it is widely used, widely understood, and supports five comparison operators. They are:
- Less than: <
- Less than or equal: <=
- Equal: =
- Greater than or equal: >=
- Greater than: >
You can read more about the PostgreSQL B-tree operator class here.
What about text?
By default a B-Tree index will not provide operators for text based data types including varchar, text and char. This can make searching text using an operator such as LIKE , ~ (REGEX) or = possible.
UPDATE test_index SET first_name = ‘Joshua’ WHERE id = 50;
The UPDATE we just executed will change exactly one row. We are changing the column first_name from ‘My First_name’ to ‘Joshua’ WHERE id = 50. Now let’s SELECT only that single column:
[...]
Planning Time: 0.078 ms
Execution Time: 45.385 ms
Now we add a B-tTree index with text_pattern_ops and run the query again:
CREATE INDEX first_name_idx on test_index(first_name text_pattern_ops);
EXPLAIN ANALYZE SELECT * FROM test_index WHERE first_name = 'Joshua';
Index Scan using first_name_idx on test_index
[...]
Planning Time: 0.255 ms
Execution Time: 0.052 ms
The other two text pattern operators are varchar_pattern_ops and bpchar_pattern_ops for use with the data types varchar and char respectively.
Case insensitive searching
A common problem this can solve is case-insensitive searching. By default PostgreSQL will not use an index if you were to use an operator such as ~* or ILIKE. You can work around this problem by using a functional index.
CREATE INDEX lower_first_name_idx ON test_index(LOWER(first_name));
SELECT * FROM test_index WHERE lower(first_name) = ‘Joshua’;
[...]
Planning Time: 0.079 ms
Execution Time: 0.112 ms
Partial Indexes
By default PostgreSQL will index an entire data set within a column or columns. There may be times when this is not necessary. Consider a situation where you know that 99% of the time you are pulling data from only the first 100 ids in our table. You could define an index like this:
CREATE INDEX fifty_and_under_idx ON test_index (id) WHERE id BETWEEN 1 AND 50;
Index Scan using fifty_and_under_idx on test_index
[...]
Index Cond: (id = 35)
Planning Time: 0.844 ms
Execution Time: 0.066 ms
PostgreSQL chose the fifty_and_under_idx instead of the PRIMARY KEY because the cost of using the fifty_and_under_idx is cheaper. It is a smaller index and therefore uses less memory, less cpu resources and is faster to provide the required data.
Index only scans and Covering Indexes
In older versions of PostgreSQL, a B-tree index only contained pointers to the heap (table) of where a particular set of data could be accessed. This can cause performance issues because we first need to scan the index and then retrieve the data from the table. While Index scans are relatively cheap due to how they store data, table scans can be very expensive because the data can be at any place within the table. Though current enterprise methods of storage have alleviated much of this (SSD/NVME) performance impact, it still exists as we need to search for the data twice.
An Index Only or Covering index allows the storing of specific data within the index. When this type of index is utilized we only need to retrieve data from the index itself and we do not need to retrieve the data from the table. However, specific conditions must be satisfied for these types of indexes to work. From the PostgreSQL docs[2]:
- The index type must support index-only scans. B-tree indexes always do. GiST and SP-GiST indexes support index-only scans for some operator classes but not others. Other index types have no support. The underlying requirement is that the index must physically store, or else be able to reconstruct, the original data value for each index entry. As a counterexample, GIN indexes cannot support index-only scans because each index entry typically holds only part of the original data value.
- The query must reference only columns stored in the index. For example, given an index on columns x and y of a table that also has a column z, these queries could use index-only scans:
SELECT x, y FROM tab WHERE x = 'key';
SELECT x FROM tab WHERE x = 'key' AND y < 42;
but these queries could not:
SELECT x, z FROM tab WHERE x = 'key';
SELECT x FROM tab WHERE x = 'key' AND z < 42;
Using an example from our table:
CREATE INDEX id_first_name_idx ON test_index (id,first_name); EXPLAIN ANALYZE SELECT * FROM test_index WHERE id < 50 AND first_name = 'Joshua';
Index Only Scan using id_first_name_idx on test_index (cost=0.42..6.70 rows=1 width=22) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((id < 50) AND (first_name = 'Joshua'::text))
Heap Fetches: 0
Planning Time: 0.146 ms
Execution Time: 0.025 ms