Graph Databases and Analytics
While in London, I attended a talk by Nicki Watt and Michal Bachman, two Open Credo software developers who shared their experience building a recommendation engine based on the Neo4J graph database. I asked why they had chosen that particular platform, and got a simple answer – they didn’t, the decision had been made before they came on to the project. But they did explain some useful things about what graph databases are and what they do well, not to mention what they don’t do so well.
Graph databases are said to be “schema-less”. They don’t have the relatively rigid structure that we expect in relational databases. Instead, they can store a wide variety of information, from numbers to video and more, organized in a relatively flexible structure described by a changeable graph. Neo4J is only one of many such databases, others that you may encounter include MondoDB, AllegroGraph and FlockDB. The advantages of the graph structure include rapid creation of and changes to a database, and excellent performance for many routine operations.
What graph databases aren’t made for is analytics. They don’t lend themselves to operations that might require aggregating large quantities of data, or random sampling, or classical statistical analysis. Analytics can easily bog graph databases down to a standstill.
There are practical situations where you can work around these limitations and end up with good results. So, for example, when making a recommendation, the trick might be to use a relatively small number of easily accessible cases and choose the best among them. Think of how people find partners – getting to know the people in the vicinity and evaluating them as potential partners, rather than traveling far and wide in search of an optimum mate. Another strategy includes serving an old result while waiting for a new one to be calculated, so the user never experiences a long wait for response. Graph databases perform well for transactional applications and those where a quick analysis of a modest number of similar cases fills the bill.
So what about classical statistical analysis, data mining and exploration? What about operations research? My take is that we will still do best to keep as much of that work away from transactional systems as possible, and that planning to create and maintain a relational database for analyst use should be part of the process when architecting new applications.