The Data Storage Revolution

Published in

Kainos

7 min readFeb 8, 2017

Life used to be relatively simple for the architect who wanted to store data somewhere. The relational database management system (RDBMS) was the world-conquering gold standard. All you really had to do was pick between proprietary and open source, and then a select few vendors who dominated both spaces.

However, (and as my Twitter feed seems intent on reminding me) we live in interesting times.

The RDBMS was a revolution of its own. The fact that it remained the dominant paradigm for storing data for such a long period — really encompassing the rise of the PC and the internet as we know it — shows what a strong and popular model it is. It is primarily characterised by three things:

The relational data model (schemas, constraints, normalisation)
SQL as a way for people to ask questions of the data
ACID as a model for handling transactions

When you strip away the bullshit, the modern buzzwords NoSQL and Big Data essentially refer to this revolution in data storage which came about as a response to some of these properties of the relational database.

You could boil this down to two things. Firstly some of the restrictions imposed by the use of schemas and secondly the limitations on scalability that are inherent in the ACID transactional model.

It never was much to do with SQL really. If you have meaningful data, then you will almost certainly have a need to extract answers from it using common tools which the majority of people understand.

SQL fills that role very nicely. Indeed it has gradually crept back into many of the new data stores — Hadoop added Hive, Spark added Spark SQL. Even something as heavy duty as HBase has an OLTP engine in the form of Phoenix.

A Brief History

And you may ask yourself
Well…How did I get here?

The genesis of NoSQL/Big Data begins in the mid ’00s, and with two of the giants of the web — Google and Amazon.

For Google, the need to store and index the web was the driving force behind the development of the Google File System (GFS) the white paper for which would go on to provide the foundations of the open source implementation — Hadoop.

In terms of data storage, Hadoop is a distributed filesystem. As your data grows, instead of buying bigger and ever more astronomically priced specialist kit, the idea is to buy lots of relatively cheap, commodity hardware and spread your data across that collective pool of disk space.

Because it is a filesystem, it is extremely flexible in terms of what data you actually put in it — text, audio, images, video. Again, this deviates significantly from the RDBMS model in which you need to conform to a pre-defined schema up front.

Amazons contribution originates in their need to operate a commercial website at massive scale, and the limitations of the client-server model of which the RDBMS was an invariable part.

When you hit the scale at which you need to split (or ‘shard’) a database across multiple machines, the ability to meet the requirements of the ACID model becomes difficult indeed.

It was this engineering challenge that led to Dynamo DB and the subsequent proliferation of NoSQL databases — solutions which typically traded off aspects of consistency for high availability (see CAP theorem — TLDR: you can’t have your cake and eat it)

Doug Cutting referred to this period in the late 00's as the Cambrian Explosion of data technologies — the rapid development of new data storage solutions which followed in the wake of pioneering technologies like Hadoop and the early key/value stores like Dynamo.

The State Of The Art

Same as it ever was…

Today, there is an overwhelming choice of data stores. They meet every specific need of physical persistence, data modelling, data processing, scalability and access models.

Below should give you a sense of some of the main categories into which the modern data stores fall along with some popular examples of each.

It’s a little bit rough, but this is organised from technologies which are generally “more structured” in the top left to “less structured” in the bottom right. Pretty much everything outside of the relation database box makes some kind of concession on consistency in favour of availability/scalability.

Hadoop in itself is a whole ecosystem of data processing tools layered on top of the HDFS file system. Here I’m emphasising the storage mechanism itself. File systems like S3 are also often integrated with frameworks like Spark which can also sit atop HDFS or a distributed database like Cassandra.

The “wide column store” is an interesting sub-category of the k/v store. In systems like HBase and Cassandra it is possible for each row to have a different set of columns. They are thus very efficient at storing data which is “wide” and “sparse”. In a traditional RDBMS you might have to define hundreds of columns to cater for this and end up populating most of them with NULL values — a great waste of disk space and difficult to maintain.

The Impedance Mismatch

Lost in translation

We mentioned that the ability to trade strict consistency for greater scalability was one of the motivating factors for non-relational databases.

The restrictive nature of the ‘schema on write’ model was the other.

In a relational database you insert data into tables, and those tables have a strict definition of what is allowed — the schema. This is a very good thing in a lot of ways, but it does present some limitations.

One problem is that it’s difficult to evolve over time. Changing a tables schema has implications for backwards compatibility. In most RDBMS systems it also means taking the database offline to apply the changes.

If you are iterating fast and your model is constantly evolving and changing then dealing with relational schemas can feel like a lot of hard work.

This is where document databases come in. Data stores like MongoDB address the so-called “impedance mismatch” between how data is represented in application code and how it needs to be translated into the de-normalised tabular structures imposed by the data tier.

If you deploy an Express JS application on top of Mongo DB then, to paraphrase Douglas Adams, it’s “JavaScript all the way down”.

Think Graphs

Wanting connections, we found connections — always, everywhere, and between everything. [Foucault’s Pendulum]

Graph databases are about extracting meaning from the relationships between things — where there is more value in the connections than perhaps the entities themselves.

Knuth’s principle that choice of data structure has a huge (yoooge) impact on how that data is processed is perhaps nowhere more true than for graphs.

Neo4j is by far the most popular pure graph database. It’s interesting to note that general data processing frameworks like Spark also incorporate the ability to process data as graphs [GraphX] although the persistence layer may not be as finely tuned for that specific need.

The classic graph use case is the “six degrees of separation” problem, or how many hops do you need to connect someone to Kevin Bacon. Graphs shine when solving problems of connectedness and traversal — route planning, social networks, genetics, identifying cliques, small worlds. Google’s PageRank solution for sorting search results is one of the most famous graph processing algorithms — the connectedness between documents on the web.

An RDBMS allows you to relate things through the use of foreign keys and joins, however typically you need to know the number of joins in your query up front. If you don’t know how many connections it might take to establish that link then graphs are your answer.

Data Streams

Persisting data as a stream of events is not a new concept — the transaction log in every RDBMS is exactly this — however it has found a new home in a variety of distributed messaging systems of which Kafka is currently the foremost example.

Kafka is the database log at web scale. It looks to the outside world like a pub/sub message queue, but under the hood is a resilient, high-throughput distributed event log.

Kafka can be used as the backbone for micro-service architectures, as a pipeline for large scale data integration (its original use case at LinkedIn) and for stream processing — a flavour of analytics where the value lies in reducing the amount of time between an event happening and being able to measure it and react — think network intrusion detection or credit card fraud.

Complexity Kills

So what are we to do about all this variety?

For those people who are somewhat new to the world of non-relational data storage I usually give the optimistic advice that what this really represents is opportunity — opportunity to solve lots of different classes of problems that were previously beyond the reach of all but the largest Valley web companies.

Hadoop and all the offshoots of the non-relational database represent a democratisation of the technology that made the modern web possible.

That opportunity is there for anyone with the right kind of data problem — be it a retail company trying to scale with a growing customer base or a group of journalists trying to make sense of the Panama Papers.

As always, it is easy to make a software system work harder than necessary to get a job done — this is the essence of over-engineering. It is also undeniable that distributed systems are inherently harder to understand.

In many ways a data store is the foundation on which the rest of a software solution is built and it has big implications for how you will be able to scale. And if you DON’T need to scale — then think hard before you look beyond the relational database. There are good reasons why it’s been at the top for a long time.

The final thought I would leave you with is that you’ll save a lot of time and money by thinking about it early. Building things and putting them in production is expensive — designing and thinking about scale come relatively cheap.

References + Further Reading

Scalability Rules -Principles for scaling websites
Next-generation databases
I heart Logs
Hadoop, The Definitive Guide