By Anders Cornell, Jr. DBA
PostgreSQL is a great piece of software. Its features are well-designed, and they compose elegantly. It’s among the most versatile and reliable software I've ever used and its comprehensive superiority over other relational database products leads me to think of PostgreSQL as the data-store that can do anything. But today I'm here to discuss something that PostgreSQL can't do: handle null characters (also known as zero bytes) in text values.
Conventionally, a zero byte is reserved to mark the end of a text string, so a zero byte inside a string is a contradiction. If a text string were to contain a zero byte, then that string would be truncated by any software that relies on C’s null-terminator convention. Text encodings have evolved since this convention was established, but no widely-used encoding has introduced a new meaning for the zero byte. So, regardless of encoding, if a zero byte is encountered within a string, it is probably safe to assume that it is there by mistake.
In contrast, modern systems treat zero bytes with much more passivity. Strings are no longer null-terminated in the software of this century; instead, every string is stored with an explicit length. This describes almost all software written in C++, Java, Python, Ruby, Go, JavaScript, Clojure, or Rust, for example, as well as most newer C code that does serious text handling. Application software that treats a zero byte as a string terminator is now the exception.
As a result, mistake or not, zero bytes occur in text nowadays. Cosmic rays, buggy UI code, and meddlesome users are all capable of producing a text string containing them.
Ideally, these zero bytes and other splashes of definite meaninglessness in text could be summarily rejected as errors, but human language is horrifically complicated, and in practice text validation must be conservative. Absent higher-level, application-specific validation, the most you can do without stepping on someone’s toes is to verify that the bytes of a string decode to a sequence of valid characters in your chosen character set.
Since it’s 2020, your chosen character set is Unicode, encoded with UTF-8. In UTF-8, a zero byte represents the code point U+0000 (NULL), just as a 0x61 byte represents U+0061 (LATIN SMALL LETTER A). The Unicode Standard does designate some code points “noncharacters” that “should never be interchanged,” but U+0000 is not one of them. In other words, there is no basis for rejecting null characters in the standard. Accordingly, UTF-8 decoders and Unicode normalization routines, as found in modern library code, do not reject null characters.
But PostgreSQL, whose backend codebase is over 30 years old and written in C, does. Try, for example:
postgres=# SELECT e'string with a \0 byte';
ERROR: invalid byte sequence for encoding "UTF8": 0x00
PostgreSQL's UTF-8 text type does not allow zero bytes. In other words, there are valid Unicode text strings that PostgreSQL cannot store as text.
The implications are far-reaching due to the foundational nature of the text type. For example, jsonb represents JSON string values internally as text, which means that, counterintuitively, it is possible to give PostgreSQL some valid JSON and get back an error:
postgres=# SELECT '{"a_json_object": "with_a_\u0000_byte"}'::jsonb;
ERROR: unsupported Unicode escape sequence
LINE 1: SELECT '{"a_json_object": "with_a_\u0000_byte"}'::jsonb;
^
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1: {"a_json_object":...
To give another example, it is similarly impossible to use PostgreSQL's text-search features on a document containing null characters, since such a document is not representable as text.
To be fair, for a system that cannot accept text with null characters, PostgreSQL handles text with null characters commendably. It's careful not to let a null character slip in through DML, and throws a descriptive error rather than silently truncating the string. Furthermore, the frontend-backend protocol does not use null-termination, sparing database driver code the responsibility of catching zero bytes. It's hard to imagine how a system that uses null-termination internally could behave better in the face of null characters. Most software written in C does much worse, and PostgreSQL's diligence does a lot to head off potential null-byte injection vulnerabilities.
However, PostgreSQL comes in last among relational databases in null character support. Oracle, MS, and even MariaDB, for all their faults, treat U+0000 like any other character.
If one accepts that text can contain null characters, but still wants to store such text in PostgreSQL, there are two workarounds:
- Strip null characters out, or replace them with a different character (I suggest U+FFFD REPLACEMENT CHARACTER) before passing text values to the database. Otherwise, PostgreSQL will abort the transaction and throw an error. The possibility of null characters must be considered at every occasion where your application hands a string off to the database. Even though null characters are rare and unmeaningful, the need to remove them will present an unexpected burden.
- Abandon text and store the UTF-8 bytes of the string in PostgreSQL as bytea instead. Zero bytes are, of course, permitted in bytea values, so a column of type bytea can store any valid UTF-8 string. (Not to mention any invalid one.) With bytea, INSERTing a valid string will never cause an “invalid byte sequence” error, and when later SELECTed, a string will return from the database with all its characters intact. This transparency comes at the expense of all of PostgreSQL's text-processing features. JSON, full-text search, locale-aware comparison, regular expressions and more are unavailable when using bytea.
One of these approaches must be identified and implemented on an application-by-application basis. As long as PostgreSQL rejects null characters, individual engineering teams will continue to spend time working around the problem. Some teams have been lucky, their systems never encountering a null character, and haven’t had to spend time implementing a workaround. This will become less common as UTF8-encoded Unicode becomes the standard approach to text representation and the null terminator convention dies off--both of these processes are well underway.
There is only one solution. It will be difficult, but it is necessary. PostgreSQL must learn to accept null characters.
Supporting null characters will be a breaking change, but on the bright side, it need not break applications or database drivers--the frontend-backend protocol could be respecified to allow embedded nulls, without changes to client-side code, because text values in the protocol are already length-prefixed, not null-terminated. Text values are also length-prefixed in tuples on disk, so no change is needed in the disk data format either.
Here's my four-step plan for fixing PostgreSQL to support null characters:
- Switch to a length-prefixed string representation for all internal text-processing code: Datum (already length-prefixed) instead of char *. Null characters are still disallowed, but at this point, absent extensions, the database could handle them correctly.
- Deprecate public functions (functions in both the C and fmgr sense) that use the cstring data type. This includes supporting and preferring Datum-based I/O functions for new base types defined in extensions, instead of the cstring-based ones that are currently required.
- Introduce a cluster-wide, off-by-default configuration option for allowing null characters. Turning the option on will break extensions that use the deprecated functions, and applications that assume null characters are illegal.
- Far in the future, when the null-termination convention is but a distant memory, allow null characters by default.
Null-termination is a relic, and beginning to show its age. A previous generation of developers may protest, but in software that handles UTF-8 text, erroring on zero bytes should be considered a bug. To keep its status as a reliable, enterprise-grade, production-ready relational database and worthy core component in modern software stacks, PostgreSQL must make this difficult transition.
How has the null-character bug affected your company? Did you discover this blog post after a single zero byte crashed your entire application? What do you think of my four-step plan? Let's make PostgreSQL better together. Leave a comment below.