Sunday, March 23, 2014

What I think of jsonb

Unsurprisingly, there has been a lot of interest in the jsonb type, which made it into the upcoming 9.4 release of Postgres. I was initially a reviewer of jsonb, although since I spent weeks polishing the code, I was ultimately credited as a co-author.

Jsonb is a new datatype for Postgres. It is distinct from the older json datatype in that its internal representation is binary, and in that it is internally typed. It also makes sophisticated nested predicates within queries on jsonb indexable.  I've occasionally described the internally-typed scalar values as having “shadow types” unknown to the core SQL parser. This has several implications. For example, if you sort two Jsonb values containing only scalar numbers, the implementation invokes the numeric comparator (which the jsonb default B-Tree opclass comparator is defined in terms of). The on-disk representation of jsonb includes the same representation as is used for, say, numerics (as the internal binary representation of JSON primitive numbers, for example). Plus, JSON objects are de-duplicated by key on input, and optimized for cheap binary searches within a single jsonb. Still, like the earlier json type, jsonb in every sense “speaks JSON”. There are some limitations on what can be represented as a jsonb number, but those are exactly the same limitations that apply to the core numeric type (plus some limitations imposed by the JSON RFC, such as not accepting NaN values). I hope it suffices to say that these limitations are virtually irrelevant, and that many implementations have similar or worse limitations. All of these minor implementation-defined restrictions are explicitly anticipated and allowed for by the recent JSON RFC-7159.



Jsonb is emphatically not like the BSON format used by MongoDB. That format accepts input in such a way as to be backwards compatible with JSON, but I believe that BSON isn't really a practical interchange format, because the software development community at large is presumably disinclined to buy into an interchange format that as yet is not described by any RFC, or any communiqué of a recognized standards body. In contrast, jsonb is a datatype that will only ever output valid textual JSON, and will only ever accept valid textual JSON (subject to the aforementioned obscure and practically irrelevant restrictions, and the caveat on automatically normalizing duplicate-keyed pairs within the same object). Jsonb also imposes an internal ordering on object pairs. Again, this is all anticipated and allowed for by the JSON RFC.

It's possible that I'm mistaken, and that BSON or something else will emerge as an actual standard (either de facto or de jure), since I've heard that there is support in the works for database systems other than MongoDB. It's not impossible that pursuing something like BSON might be an interesting future direction for Postgres, since for one thing BSON supports more than the 4 standard JSON primitive types. In any case it's important to note that the protocol or on-disk binary representation of jsonb is an implementation detail; we're not in competition with BSON, and this isn't a new standard. It's just a new Postgres datatype, with new indexing capabilities. I think it's notable that BSON doesn't have a JSON-style universal number type. It has 32-bit and 64-bit integer types, and double precision 64-bit IEEE 754 floating point numbers. It strikes me that this omission tells me all I need to know about binary interchange formats.

To understand how the jsonb type works in more detail, I suggest taking a look at the devel documentation. It's worth taking a close look at containment semantics, since that's the really compelling way of searching through jsonb documents.

Strategic significance


There has been a little back and forth among senior community members about the significance of jsonb. Josh Berkus wrote that he thought it was the most important 9.4 feature. Robert Haas was skeptical, preferring the logical decoding stuff. I've even seen one or two people in the comments section of various news articles grumble about Postgres jumping on the JSON bandwagon.

I have to admit that relatively speaking, jsonb is not in and of itself all that technically complex. While it is a great feature, and while I think it puts Postgres in a very competitive position relative to certain other systems, it would be almost trivial to ship a jsonb extension that works with earlier versions of Postgres. However, without taking a position on what the best 9.4 feature is going to be (I like both jsonb and logical decoding, and contributed in various ways to both), I think that it's possible that Josh Berkus and Robert Haas are both more or less right at the same time, and their apparent disagreement reflects only their individual priorities for Postgres.

It is very much to the credit of the principle jsonb authors, Oleg Bartunov and Teodor Sigaev, that with some help from Andrew Dunstan and myself they managed to define what I think is internally a solid nested, strongly-typed format for us to build on, with a textual output representation that just so happens to be the same one that has emerged as a standard for this kind of thing. But, to me, as a Postgres hacker, their previous work – and the previous work of Alexander Korotkov (who, due to an unfortunate oversight, was not credited in the jsonb commit message) – is the real story here. As the authors of GIN, Oleg and Teodor are perhaps most responsible for the foundation on which jsonb is built, a foundation built over many years. Alexander's excellent recent work on improving the GIN access method (with help in various areas from Heikki Linnakangas), which also made it into 9.4, is probably what will end up making jsonb really shine. This includes compression of GIN posting lists,  speeding up "rare & frequent" type GIN queries, multi-key GIN search skipping, and further optimizations to multi-key searching. Alexander had earlier reported some very impressive improvements in PostgreSQL full-text search performance as a result of all of this, with performance apparently competing with that of external systems like Sphinx and Solr. It is likely that many of the same big performance improvements seen there concomitantly benefit the jsonb GIN opclasses.


jsonb_path_ops


Having said that, Alexander's jsonb_path_ops alternate GIN operator class, which was his contribution to the big jsonb patch deserves an honorable mention. By combining GIN with hashing of either key/value pairs, or array elements, resulting indexes can give great performance for sophisticated “containment” type queries against JSON documents. Indexes are a fraction of the size of the data indexed, index scans are incredibly fast, and yet these GIN indexes make indexable very complex nested “containment” queries. The results are so impressive that at last November's pgConf.EU conference, an EXPLAIN ANALYZE comparative example in a presentation given by Oleg and Teodor was greeted with sustained applause.

I'm really pleased that we worked towards making all of this as beneficial as possible to the largest possible number of people, but even as it puts Postgres in a very competitive position with respect to some non-relational systems, jsonb does not really represent any kind of pivot towards Postgres as a document store – Postgres has always been an object-relational system. Rather, I think it is one particular outcome of a much bigger process that has been underway for many years.

I'll watch the future development of the "VODKA" index access method with interest, because at this early stage it is my understanding that it's intended to make searching nested, heterogeneous structures more flexible and better performing still. It seems likely that there will be a number of other applications for that infrastructure too, since like GiST, GIN, and SP-GiST, it is intended to be an extensible infrastructure that serves many analogous needs in a general way.

6 comments:

  1. I very like your notice, that jsonb is a consequence of our (I and Teodor Sigaev) more than decade development - intarray, tsearch, hstore, ltree, full text search, pg_trgm, GiST, GIN, SP-GiST are all about semi-structured data.

    ReplyDelete
  2. Congrats to the Postgres team. jsonb is an exciting feature and I'm looking forward to it! Thanks for all your hard work!

    ReplyDelete
  3. This is indeed going to be a cool feature. We made tests with a different bson implementation and PostgreSQL and it was much faster than MongoDB, so I expect about the same speed-up. In many use cases, people will be much better off with PostgreSQL then.

    What I don't really understand is the roadmap regarding jsonb and hstore. Is jsonb going to be the ultimate type for document-style data? Will it supersede hstore or will they coexist? As I read in a presentation from Oleg, hstore in (probably) 9.4 has similar features to offer.

    ReplyDelete
  4. hstore is going to continue to be useful as a simple key/value store, but I don't think it's likely to expand in scope, given the discussion on list. jsonb is in a certain sense the successor of hstore, given the technical foundation that they share, but that's about the extent of it.

    Originally, we thought that hstore could be made to support multiple input formats, an hstore compatible one, and a JSON one, but during our discussion it transpired that that probably wasn't worth it. Personally, I'm happy to let hstore be hstore. I think that jsonb is definitely the area that we'll target for future improvements to general handling of semi-structured or unstructured data.

    ReplyDelete
  5. I hope all of those TFS performance improvements from 2012 went into the PostgreSQL 9.4 codebase.

    If they are, they should get a lot more attention. They should at least be mentioned in the PostgreSQL 9.4 release notes.

    ReplyDelete
  6. Perhaps you would know the answer to a jsonb question: TOAST does not seem to do any compression on this format, even for large json docs. Is that by design?

    ReplyDelete