Exploring AWS Neptune

8 min readAug 17, 2021

Use cases and considerations for non-functional requirements when using Amazon’s Graph DB as-a-service offering.

What is Neptune?

AWS Neptune is a fully managed graph database service that supports a variety of data models for processing highly connected datasets — a persistence layer upon which you can build applications that extract insight from those connections in a performant and scalable way.

Why Graphs?

A database is traditionally a way of storing information about “things” — customers, orders, videos, cows, biscuits — you name it.

However, sometimes we’re more interested in the relationships that exist between the things, than the things themselves.

And whilst the relational database provides many tools which can tell us this type of information — foreign keys, JOINS — these tools quickly hit limits of performance and syntactic expressiveness when the main problem involves interrogating a large and complex set of relationships.

In a graph database such as Neptune, the main components of the graph model are:

Vertices — these represent the “things” and just like a table in an RDBMS they can have attributes with various types and values
Edges — representing relationships between vertices. These are the facts which connect things. They too can have a variety of attributes.

The types of use case which led our team to consider graphs are actually fairly common and repeated across many different organisations and business domains.

If you have a large quantity of unstructured data, say just for example a big network drive full of documents (who would ever have such a thing I hear you cry, its unheard of) then there may be value to you in knowing if those documents contains “things” (vertices) that match (the edge) similar things in other documents.

A graph shines in this type of use case because of a few factors:

Whether or not a document contains a certain attribute of information is completely unknown beforehand given their unstructured nature. In a relational model, we define schemas up front, but what we want is more of a schema-less approach where attributes can be assigned to vertices arbitrarily if they exist. Neptune gives us this ability. We might establish that a relationship exists between two documents only if they share the same attribute and if those attribute values match in some meaningful way (in which case we insert an edge into our graph)
Connectedness may be unbounded. JOINS in a relational database are limited when it comes to connections that may be arbitrarily deep. You may have heard of the famous “six degrees of separation” (or Kevin Bacon) whereby anyone in the world can be linked to anyone else with no more than six hops between individuals. Any given document could be linked to any number of other documents by many different combinations of attributes.

Building The Graph

Neptune can represent its graph using a number of different data models. The one we chose to go with is the Gremlin model format as used by Apache Tinkerpop. This is very handy as vertices and edges can be represented as a series of CSV files which can be reliably and cost-effectively stored in S3, even for very large graphs (we had one of those).

It looks something like this:

Vertices can have a variety of attributes represented by the <name>:<type> column format. This is the aforementioned capability to use a schema-on-read approach, defining zero or more properties for each of our records.

And for the edges:

Note the ‘~’ character in the header titles — this denotes attributes which are required system fields, for example each vertex or edge has a unique ~id, often a UUID and in the case of edges, ~from and ~to are the Ids of the vertices on either end of the relationship with ~label being the name of the relationship, or relationship type.

The inclusion of the tilde character is an example of an implementation detail which seems inconsequential on the face of it, but in my case was actually quite an annoyance.

RDS for PostgreSQL (the data source in this case) provides a neat extension which allows data to be exported directly to S3 storage. Unfortunately your options for naming the output headers are limited to being valid PostgreSQL column names and that discounts use of ‘~’ with no way to override this on the Neptune API.

So… hand crank the export process, admittedly not a huge amount of effort, just annoying when there’s an OOTB mechanism that takes you 99% of the way there which you can’t use.

Loading the Graph

When it comes to getting our data into the graph database, AWS provides the Neptune Bulk Loader service.

As illustrated above, this works by providing an API which delegates the load of files in the graph format described above to the Neptune service layer, using S3 as the data source.

You make an API call which includes the path to your files, a variety of configuration options such as controlling error handling and then behind the scenes the service manages the load of the data for you.

For bulk load scenarios, this is preferable to rolling your own solution using inserts via the usual Gremlin API through which you run your queries. If you’re more familiar with RDBMS, it’s the order of magnitude difference between using a bulk insert versus cursors and per-record insert statements.

The service is smart enough to automatically identify which files in the provided S3 location are the vertices and which are edges and ensures that all vertex data is loaded first. Greater parallelism can be achieved by splitting both sets of files into smaller chunks, thus improving the load performance. Again this is managed internally by the service, you just need to ensure your input files are partitioned in some way that gets you an even distribution.

A graph of approx. 200 million vertices and edges could be loaded this way in a matter of hours using a reasonably modest db.r4.xlarge instance size and some limited partitioning of the source files.

Part of the configuration options mentioned are how much of the cluster resource you dedicate to the load process. This could be as much as totally locking out all other access to maximise throughput. Obviously that’s a decision you’ll know whether you can make depending on your own SLAs.

Monitoring Bulk Loads

The Bulk Loader is somewhat limited in the observability that you get out of the box. There’s no handy console dashboard for example.

What your do get is an API which allows you to inspect the progress of any bulk load tasks which the cluster is currently managing — due to its nature it is of course an async background job and requires you to poll this API in order to find out what’s happening.

This Get Status API takes the load id (just another UUID) which is assigned to each load when it is initiated and returns various metrics like total records to be ingested, total processed to date, failure counts etc. along with the crucial overall status — LOAD_COMPLETED and LOAD_FAILED being the main two binary outcomes, with a variety of more specific statuses for auth failures, partial failures and so on.

All super useful, but does require you to roll your own Lambda or similar to make that API call to poll for the eventual result, so be aware that this client is something you will need to plan to implement, assuming fire and forget is not an option.

Inspecting the Graph

Gremlin is certainly a new experience if you’re coming from the world of SQL (as with myself).

A simplified view of my use case might look something like this:

Simple data model for entity resolution across documents

AWS provides many services for the task of actually extracting named entities (things like names, addresses and so on) from unstructured documents. See Textract (for text extraction) and Comprehend (for named entity recognition, among other things)

Once you have that information it becomes a case of establishing which vertices in the graph have matching values and then inserting MATCH edges to indicate that this link exists. You can either pre-calculate that up front and include it as a file in your bulk load as described above, or if you need to do it on a more case-by-case basis, then you can insert a new edge using the Gremlin query API.

The statement below shows the type of syntax used to insert an edge between two vertices of interest:

g.addE(“MATCHES”).from_(source_vertex_id).to(target_vertex_id).next()

At this point we would have all the information needed to accept queries to the graph which will find all matching documents for a given property value, or find all the matches associated with a particular document — very useful in many general applications where we want to see the connectedness that exists between otherwise meaningless collections of files.

Other Ways To Graph

Neptune is a very capable graph database service provided by AWS. The same case for using a managed database service stands here just as it would for using RDS over rolling your own MySQL cluster.

However, it is worth considering that there are other approaches to applying graph techniques to our data.

In our use case, the need to support both batch/offline and online access to graph information is one factor in deciding if you need a dedicated and long-running GraphDB instance.

If the workload was purely batch, I would strongly consider a technology like GraphX which would allow you to interrogate data in S3 directly using something like an ephemeral EMR Spark cluster to reduce ongoing compute costs and avoid needing to master graph data in another storage solution.

Thanks for making it this far and hopefully you found my two cents on Neptune to be useful!

Certainly a powerful tool for exploring connectedness in your data, but with a fair degree of learning curve. As always, use as much of the out-of-the-box tooling and APIs as you can to support your requirements and if you do need to start rolling a lot of bespoke code, consider if there are other platform options that might be a better fit for your use case.