There has been a lot of discussion recently in the XBRL community about the use of XBRL for very large datasets. There are a number of misconceptions around about the practicalities of working with large instances, and some confusion about the extent to which different approaches to processing XBRL can improve performance. This article attempts to shine some light on the problem, and propose ways in which performance could be improved when working with large datasets.
When XML was gaining popularity at the turn of the century, there were many people who complained that it was an inherently inefficient way to work with data. For anyone with experience of packing data into binary structures to minimise storage and memory usage, the idea of using XML tags around text representations of data seemed extremely wasteful.
The reality is that whilst XML is inherently inefficient relative to packed binary formats, or even CSV, computing power and memory usage had reached the point where, for most everyday data sets, the performance implications of this inefficiency were negligible, and were outweighed by the benefits of working with self-describing data that could be processed using standard validators and tools.
As computers have continued to evolve, the cut-off for how much data it is reasonable to handle using XML has increased. For example, the core of many XML applications is the Document Object Model (DOM). Memory requirements for DOM are of the order of ten times the size of the XML document. In a world where computers with several gigabytes of RAM are commonplace, processing XML documents that are tens of megabytes in size has become feasible, but documents that are more than a few hundred megabytes in size remain problematic.
For such datasets, there are essentially two options:
XBRL, being built on XML, suffers from the same inefficiency of representation, and the same challenges in processing. In fact, in many cases, the problems are more acute as XBRL is not particularly efficient in its use of XML. This is particularly noticeable for heavily dimensional data, where each <context> element is only used by a small number of facts.
As noted above, many processors are built around the Document Object Model (DOM), or some other DOM-like interface such as libxml or Python’s lxml1. The key feature of such interfaces is that the XML document is parsed into an in-memory representation allowing random access to all information that was in the XML. A “universal truth” that is often cited by people who know just enough to be dangerous is that “DOM is really inefficent“. Whilst it is true that the memory overhead of the DOM is significant, the question of whether it is an efficient way to solve a particular problem depends on the nature of the problem and what the alternative approaches are.
The standard alternative to a DOM-like approach is a stream-based approach such as SAX. SAX presents an XML document as a series of events, and it is up to the consuming application to extract the useful items of data as the events are received, and typically, store the extracted information in some in-memory representation.
The key to a stream-based approach being more efficient than a DOM-based approach is how much information you store in memory as a result of the parse, and the key to that is whether you can know in advance what subset of information you want from the XML document.
When working with an XBRL document, you generally don’t need all the information that’s in the XML model. What you want is an XBRL model. You don’t want to work in terms of elements and attributes, you want to work in terms facts, concepts, labels, dimensions, etc. In an ideal world you could SAX-parse your XBRL document straight into an XBRL model, and there would be no need for a DOM-style, in-memory representation of the XML.
Unfortunately, we don’t live in an ideal world, and there a few ways in which XBRL clings unhelpfully to its underlying XML representation. The heart of the problem is that there is no well-defined information model for XBRL. There have been various efforts to create one, such as the XBRL Infoset, and more recently the Abstract Model, but none have yet come to fruition. The result of this is that there is no common agreement about which parts of an XML document are “significant” from an XBRL perspective, and which parts are irrelevant syntax-level detail.
A good example of where this creates a practical problem is XBRL Formula’s use of XPath as an expression language. Whilst the primary way of selecting facts for use in a formula is to assign them to variables using the fact selection constructs provided by the specification, XBRL Formula allows formula writers to include arbitrary XPath expressions. In other words, they can work not just with the XBRL, but with the underlying XML. Whilst this makes XBRL Formula very powerful, it means that an XBRL Formula processor is obliged to keep a copy of the XML document in memory in order to support the XML-based navigation required by XPath. In other words, if you want to use XBRL Formula, you’re pretty much stuck with the DOM, or something very much like it.
Another example is in the specification of validation rules. Here at CoreFiling, we’ve got a really nice XBRL model in the form of our True North API, and it makes writing XBRL validation rules really quick and easy. Unfortunately, validation requirements are often specified in terms of XML syntax rather than an XBRL model (this isn’t altogether surprising, given the above-mentioned absence of a commonly agreed XBRL model). A prime example of this is the Edgar Filer Manual, which defines validation criteria for SEC submissions. A quick read of the manual reveals rules specified in terms of XML artefacts such as elements and attributes, and not just XBRL artefacts like facts and concepts. The net result of this is that in order to implement many of these rules accurately, we need to dive behind our nice XBRL model and delve into a lower-level DOM-like model of the XML.
To summarise, in order to work with XBRL more efficiently and allow scaling to much larger instance documents, we need to work with it as XBRL, not XML. We need to introduce the notion of a “Pure XBRL Processor” which is free to discard irrelevant XML syntax.
In order to do this, we first need to define a commonly agreed XBRL model. We can then be clear about which problems can be solved with an efficient Pure XBRL Processor, and which are dependent on a processor with access to the underlying XML.
We then need to revisit technologies such as XBRL Formula and figure out how we can make them work with a Pure XBRL Processor. One option, of course, is to switch to an entirely different technology such as Sphinx which is already built on top of an XBRL model.
Another option is to restrict the XPath expressions that are allowed in XBRL Formula to a subset that can be implemented on top of a Pure XBRL Processor. In other words, retain the ability to access functions and variables, but remove the ability to do arbitrary navigation of the underlying XML document. This would be no bad thing. I spoke recently to XBRL Formula guru Herm Fischer, and he expressed his concern at the number of Formula rules he’d seen that use XPath expressions to navigate the XML model, rather than treating it as XBRL.
I’ve written previously about the risks of trying to treat XBRL as XML. Restricting XBRL Formula so that it can only work with the XBRL Model should lead to better written, more robust XBRL Formulas, and hopefully will guide rule writers away from concerning themselves with irrelevant syntactic details.
Of course, whilst a pure XBRL approach has the potential to use far less memory than one which must retain an XML model, ultimately any in-memory approach is going to have memory requirements that are proportional to document size, and so will always have an upper limit on the size of document that can reasonably be processed on any given hardware. For extremely large instance documents, more radically different approaches to processing will be necessary. Such approaches may well rule out the possibility of using familiar technologies such as Sphinx and Formula altogether. For such documents, moving to a pure XBRL approach is a necessary first step, but it’s not the whole solution.
I’m sure that these suggestions won’t appeal to everyone, but as XBRL moves into the enterprise, we need to free the information from the syntax used to represent it.
1. From this point on, I use the terms “DOM” and “DOM-like” to refer to any approach that stores an in-memory representation of the full XML model. Whilst it’s certainly possible to create DOM-like implementations that are more efficient than an actual DOM implementation, memory usage is still likely to be some multiple of the original document size and so will still suffer from the same fundamental performance limitations.