Facebook EmaiInACirclel
Data Strategy

Akka Persistence Titan Plugin, a smooth mindset adjustment from SQL to NOSQL databases

PentaGuy
PentaGuy
Blogger

Akka [1] is an Open Source toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. It provides real-time transaction processing based on the Actor Model [2].Akka is a very scalable piece of software, not only in terms of performance but also in what concerns the size of the applications it is useful to. The core of Akka, the akka-actor, is very small and can be easily dropped into an existing project where one needs asynchronicity and explicit lockless concurrency. Akk provides great performance either run on a single machine or in a distributed manner.akka developmentActors also let you manage service failures through supervision and monitoring, load management (back-off strategies, timeouts and processing-isolation), as well as both horizontal and vertical scalability (add more cores and/or add more machines).There is an impressive number of use cases for which Akka suits best, for example:

  • Transaction processing (Online Gaming, Finance/Banking, Trading, Statistics, Betting, Social Media, Telecom)
  • Service backend (any industry, any app)
  • Service REST, SOAP, Cometd, WebSockets etc. act as message hub / integration layer Scale up, scale out, fault-tolerance / HA
  • Concurrency/parallelism (any app)
  • Camel integration to hook up with batch data sources Actors divide and conquer the batch workloads
  • Communications Hub (Telecom, Web media, Mobile media)
  • Complex Event Stream Processing

Pentalog case study

At Pentalog, we are familiar with Akka and Titan since we have been using both of them for a few years already on our projects. The most common configuration we use is based on Cassandra and Elasticsearch, with Akka providing the backbone for backend systems that expose REST like APIs.

Why Titan?

Why a NOSQL databaseNOSQL databases are definitely not the next step in the evolution of the SQL databases. They are simply an alternative to them and it’s all about choosing the right tool for the job. The structure of a database always changes in time, as an application evolves and existing requirements are refined or new ones are added. These changes are more difficult to handle with relational databases. If columns are added to existing tables, they will have to either specify dummy default values for already existing entries or to allow null values. Another strategy would be to use user defined fields [3] for this, but it will imply additional joins and joins are slow [4].With a NOSQL database the evolutions in structure can be easily managed. Any new “column”, while containing useful information for new entries, will simply not exist for the old one. There is no need for DDL migration scripts from one version to the next.Usually different entities in an application have different sets of mandatory attributes and different validation rules while still preserving a common behavior. There are different strategies to deal with this in a relational database [5] but there are also disadvantages for all of them.

  • With Single tables, there will be fields with null values where the type of entity does not require them.
  • With Individual tables, retrieving data across categories would require joins and either existence checks in detail tables or replicating the information by adding discriminator columns.

On the other hand, this type of diversity is handled implicitly by NOSQL databases as their schema is completely flexible.

When to use a Graph database

Associative data modelA graph consists of Nodes (Vertices) connected through Relationships (Edges). Both Vertices and Edges can have attributes, whereas Edges can also have direction (In, Out or Both). A graph storage implementation treats Vertices and Edges equally, which makes it a perfect candidate whenever relations are 1st class citizens in a data model [6]. Besides that, graph databases are known to scale more naturally to large data sets [7]. Intuitively, a large number of models in real life – even a relational database model or an ER diagram – are graph-like and this is because Graphs are very well suited to model rich domains. The emergence of social applications in recent years largely contributes to the increased popularity of Graph databases. Think about Facebook, Twitter, LinkedIn. Their data model is largely a graph of connected users. It is worth mentioning that although graph databases are nothing new. In fact, they are even older than relational databases if we were to think about the Hierarchical database model [8].

Fine grained authorization – a classic use case for a graph database

“An authorization concept needs to be in place to ensure that users can only access data which they are entitled to” – Implementing fine grained authorization with relational data is always difficult because it should be treated as a cross-cutting concern but it is, at the same time, business-related. One must either add owner group / user information to data or add foreign keys toward data at user / group level. Assuming a “user” belongs to a “group” that can access only data of “customers” from a specific “organization” there is already a four-table join involved. With Graph databases the above scenario is usually implemented by simply linking the “user” to “role” and the “role” to “organization” through edges that can, for example, have as property the permission(s) level.

Why Titan in particular

Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time [9].Among the graph databases, Titan specifically offers the advantage of pluggable persistence layer meaning that depending on the project specific needs various databases can be used:

  • In memory – primarily developed for testing.
  • Apache Cassandra – highly scalable and designed to manage very large amounts of structured data, suitable if you are interested in Availability and Partition tolerance
  • Apache HBase – more oriented toward the Consistency and Partition tolerance side of the CAP triangle [10]
  • Oracle BerkleyDB – also on the CP side of the CAP triangle but more suitable for read intensive operations.
  • There exists also an AWS storage implementation – namely AWS DynamoDB that can be used instead of deploying and explicitly managing an imposed storage engine in case your project needs to be deployed on AWS.

All the above listed persistence providers can be easily integrated with Elasticsearch, Lucene or SOLR for full text data querying. So, why Titan? Because it is both intuitive since it is a graph database as well as highly flexible in terms of data storage. As I’ve said it before, it’s the right tool for the right job.

The challenges of implementing this plug-in

I’ve always been an Open Source user and supporter and I’ve embraced any opportunity I had to give something back to the community. The development of the akka-persistence-titan [11] is one of them. At the time I started working on the plugin I was already familiar with Titan and the Akka persistence backed by Cassandra but felt that something was missing from the whole picture. For performance reasons, AKKA persistence serializes both journal events and snapshots as byte arrays rendering the data unreadable for a 3rd party application connecting directly to the storage layer. So there I was, having to decide between byte array serialization and performance on the one hand and JSON plus human readable data on the other hand… Titan allowed me to have them both.The initial implementation was part of the project I was currently working on so the first step to developing the plugin was to extract the implementation in a separate project and adjust the code to match the AKKA standard interfaces and pass their recommended test cases. Once ready, I used Github to host my project and start to look for a way to make it publicly available as an sbt/Maven dependency.A few clicks later it turned out that OSSR [12] offers public hosting for Open Source projects. For details on using the plugin just google “akka persistence titan” and you will land to its Github page.The maintenance of this page also falls under the responsibility of the community. The page is hosted on Github and editing it is as easy as issuing a pull request.What would you recommend to those who want to implement their first AkkaTItan?Well… when coming from a “traditional” enterprise-like application, and relying exclusively on a single type of relational database for persistence, there is quite a challenge to start working with Actors and a non-relational storage layer, especially a graph one. It clearly requires a mindset adjustment.With Titan, however, the transition was smooth since graphs, as a model, are closer to reality compared to a normalized database. Hence, at least for me, it’s easier to reason about them. Gremlin [13], the default query language for Titan, requires some time to get acquainted with, especially for someone without any Groovy experience but I won’t exactly call it more complex than SQL.When it comes to the Actor model implemented in AKKA, my first word of advice would be to forget all you’ve learned about concurrency in a classic, multi-threaded application and start fresh. The Actor model takes its power from its simplicity. Actors may only modify private state and communicate between them by passing messages, there is no shared state between multiple actors. Hence, they are inherently concurrent.Another painful lesson I’ve learned is that, in most cases, we do not need ACID [14] transactions in our applications. Others have realized this before me and came with alternatives, like BASE [15] for example. Giving up the ACID mindset and admitting that eventual consistency is enough in the vast majority of use cases is the ground base of implementing an Actor-based application, regardless of the persistence layer. Since data is transferred through asynchronous messages you need a way to deal with all the risks involved: messages might arrive late or not at all, they might arrive multiple times, actors might not respond in a timely manner or simply crash.Long story short, when designing a distributed system don’t forget “there is no free lunch” 🙂


Leave a Reply

Your email address will not be published. Required fields are marked *