Big data analytics with Neo4j and Java, Part 1

Graph databases like Neo4j are ideal for modeling complex relationships between collections of users--and they move through big data at lightspeed

Relational databases have dominated data management for decades, but they've recently lost ground to NoSQL alternatives. While NoSQL data stores aren't right for every use case, they are generally better for big data, which is shorthand for systems that process massive volumes of data. Four types of data store are used for big data:

  • Key/value stores such as Memcached and Redis
  • Document-oriented databases such as MongoDB, CouchDB, and DynamoDB
  • Column-oriented data stores such as Cassandra and HBase
  • Graph databases such as Neo4j and OrientDB

This article introduces Neo4j, which is a graph database used for interacting with highly related data. While relational databases are good at managing relationships between data, graph databases are better at managing n-th degree relationships. As an example, take a social network, where you want to analyze patterns involving friends, friends of friends, and so on. A graph database would make it easy to answer a question like, "Given five degrees of separation, what are five movies popular with my social network that I have not yet seen?" Such questions are common for recommendation software, and graph databases are perfect for solving them. Additionally, graph databases are good at representing hierarchical data, such as access controls, product catalogs, movie databases, or even network topologies and organization charts. When you have objects with multiple relationships, you'll quickly find that graph databases offer an elegant, object-oriented paradigm for managing those objects.

The case for graph databases

Like the name suggests, graph databases are good at representing graphs of data. This is especially useful for social software, where every time you connect with someone, a relationship is defined between you. Probably in your last job search, you picked a few companies that you were interested in and then searched your social networks for connections to them. While you might not know anyone working for one of those companies, someone in your social network likely does. Solving a problem like this is easy at one or two degrees of separation (your friend or a friend of a friend) but what happens when you start extending the search across your network?

In their book, Neo4j In Action, Aleksa Vukotic and Nicki Watt explore the differences between relational databases and graph databases for solving social network problems. I'm going to draw on their work for the next few examples, in order to show you why graph databases are becoming an increasingly popular alternative to relational databases.

Modeling complex relationships: Neo4j vs MySQL

From a computer science perspective, when we think about modeling relationships between users in a social network, we might draw a graph like the one in Figure 1.

osjp neo4j fig01 Steven Haines

Figure 1. Graphing relationships in a social network

A user has IS_FRIEND_OF relationships with other users, and those users have IS_FRIEND_OF relationships with other users, and so forth. Figure 2 shows how we'd represent this in a relational database.

osjp neo4j fig02 Steven Haines

Figure 2. Modeling a social graph in a relational database

The USER table has a one-to-many relationship with the USER_FRIEND table, which models the "friend" relationship between two users. Now that we've modeled the relationships, how would we query our data? Vukotic and Watt measured the query performance for counting the number of distinct friends going out to a depth of five levels (friends of friends of friends of friends of friends). In a relational database the queries would look as follows:


# Depth 1
select count(distinct uf.*) from user_friend uf where uf.user_1 = ?

# Depth 2
select count(distinct uf2.*) from user_friend uf1
  inner join user_friend uf2 on uf1.user_1 = uf2.user_2
  where uf1.user_1 = ?

# Depth 3
select count(distinct uf3.*) from t_user_friend uf1
  inner join t_user_friend uf2 on uf1.user_1 = uf2.user_2
  inner join t_user_friend uf3 on uf2.user_1 = uf3.user_2
  where uf1.user_1 = ?

# And so on...

What is interesting about these these queries is that each time we go out one more level, we are required to join the USER_FRIEND table with itself. Table 1 shows what researchers Vukotic and Watt found when they inserted 1,000 users with approximately 50 relationships each (50,000 relationships) and ran the queries.

Table 1. MySQL query response time for various depths of relationships

DepthExecution time (seconds)Count result

2 0.028 ~900
3 0.213 ~999
4 10.273 ~999
5 92.613 ~999

MySQL does a great job of joining data up to three levels away, but performance degrades rapidly after that. The reason is that each time the USER_FRIEND table is joined with itself, MySQL must compute the cartesian product of the table, even though the majority of the data will be thrown away. For example, when performing that join five times, the cartesian product results in 50,000^5 rows, or 102.4*10^21 rows. That's a waste when we are only interested in 1,000 of them!

Next, Vukotic and Watt tried executing the same type of queries against Neo4j. These entirely different results are shown in Table 2.

Table 2. Neo4j response time for various depths of relationships

DepthExecution time (seconds)Count result

2 0.04 ~900
3 0.06 ~999
4 0.07 ~999
5 0.07 ~999

The takeaway from these execution comparisons is not that Neo4j is better than MySQL. Rather, when traversing these types of relationships, Neo4j's performance is dependent on the number of records retrieved, whereas MySQL's performance is dependent on the number of records in the USER_FRIEND table. Thus, as the number of relationships increases, the response times for MySQL queries will likewise increase, whereas the response times for Neo4j queries will remain the same. This is because Neo4j's response time is dependent on the number of relationships for a specific query, and not on the total number of relationships.

Scaling Neo4j for big data

Extending this thought project one step further, Vukotic and Watt next created a million users with 50 million relationships between them. Table 3 shows results for that data set.

Table 3. Neo4j response time for 50 million relationships

DepthExecution time (seconds)Count result

2 0.01 ~2,500
3 0.168 ~110,000
4 1.359 ~600,000
5 2.132 ~800,000

Needless to say, I am indebted to Aleksa Vukotic and Nicki Watt and highly recommend checking out their work. I extracted all the tests in this section from the first chapter of their book, Neo4j in Action.

Getting started with Neo4j

You've seen that Neo4j is capable of executing massive amounts of highly related data very quickly, and there's no doubt it's a better fit than MySQL (or any relational database) for certain kinds of problems. If you want to understand more about how Neo4j works, the easiest way is to interact with it through the web console.

Start by downloading Neo4j. For this article, you'll want the Community Edition, which as of this writing is at version 3.2.3.

  • On a Mac, download a DMG file and install it as you would any other application.
  • On Windows, either download an EXE and walk through an installation wizard or download a ZIP file and decompress it on your hard drive.
  • On Linux, download a TAR file and decompress it on your hard drive.
  • Alternatively, use a Docker image on any operating system.

Once you have installed Neo4j, start it up and open a browser window to the following URL:

http://127.0.0.1:7474/browser/

Login with the default username of neo4j and the default password of neo4j. You should see a screen similar to Figure 3.

osjp neo4j fig03 Steven Haines

Figure 3. Web Interface for Neo4

Nodes and relationships in Neo4j

Neo4j is designed around the concept of nodes and relationships:

  • A node represents a thing, such as a user, a movie, or a book.
  • A node contains a set of key/value pairs, such as a name, a title, or a publisher.
  • A node's label defines what type of thing it is--again, a User, a Movie, or a Book.
  • Relationships define associations between nodes and are of specific types.

As an example, we might define Character nodes such as Iron Man and Captain America; define a Movie node named "Avengers"; and then define an APPEARS_IN relationship between Iron Man and Avengers and Captain America and Avengers. All of this is shown in Figure 4.

osjp neo4j fig04 Steven Haines

Figure 4. Nodes and relationships

Figure 4 shows three nodes (two Character nodes and one Movie node) and two relationships (both of type APPEARS_IN).

Modeling and querying nodes and relationships

Similar to how a relational database uses Structured Query Language (SQL) to interact with data, Neo4j uses Cypher Query Language to interact with nodes and relationships.

Let's use Cypher to create a simple representation of a family. At the top of the web interface, look for the dollar sign. This indicates a field that allows you to execute Cypher queries directly against Neo4j. Enter the following Cypher query into that field (I'm using my family as an example, but feel free to change the details to model your own family if you like):

CREATE (person:Person {name: "Steven", age: 45}) RETURN person

The result is shown in Figure 5.

osjp neo4j fig05 Steven Haines

Figure 5. Creating a Person with Cypher Query Language

In Figure 5 you can see a new node with the label Person and the name Steven. If you hover your mouse over the node in your web console, you will see its properties at the bottom. In this case, the properties are ID: 19, name: Steven, and age: 45. Now let's break down the Cypher query:

  • CREATE: The CREATE keyword is used to create nodes and relationships. In this case, we pass it a single argument, which is a Person enclosed in parentheses, so it is meant to create a single node.
  • (person: Person {...}): The lower case "person" is a variable name through which we can access the person being created, while the capital "Person" is the label. Note that a colon separates the variable name from the label.
  • {name: "Steven, age: 45}: These are the key/value properties that we're defining for the node we're creating. Neo4j does not require you to define a schema before creating nodes and each node can have a unique set of elements. (Most of the time you define nodes with the same label to have the same properties, but it is not required.)
  • RETURN person: After the node is created, we ask Neo4j to return it back to us. This is why we saw the node appear in the user interface.

The CREATE command (which is case insensitive) is used to create nodes and can be read as follows: create a new node with the Person label that contains name and age properties; assign it to the person variable and return it back to the caller.

Querying with Cypher Query Language

Next we want to try some querying with Cypher. First, we'll need to create a few more people, so that we can define relationships between them.


    CREATE (person:Person {name: "Michael", age: 16}) RETURN person
    CREATE (person:Person {name: "Rebecca", age: 7}) RETURN person
    CREATE (person:Person {name: "Linda"}) RETURN person

Once you've created your four people, you can either click on the Person button under the Node Labels (visible if you click on the database icon in the upper left corner of the web page) or execute the following Cypher query:

MATCH (person: Person) RETURN person

Cypher uses the MATCH keyword to find things in Neo4j. In this example, we are asking Cypher to match all nodes that have a label of Person, assign those nodes to the person variable, and return the value that is associated with that variable. As a result you should see the four nodes that you've created. If you hover over each node in your web console, you will see each person's properties. (You might note that I excluded my wife's age from her node, illustrating that properties do not need to be consistent across nodes, even of the same label. I am also not foolish enough to publish my wife's age.)

We can extends this MATCH example a little further by adding conditions to the nodes we want returned. For example, if we wanted just the "Steven" node, we could retrieve it by matching on the name property:

MATCH (person: Person {name: "Steven"}) RETURN person

Or, if we wanted to return all of the children we could request all people having an age under 18:

MATCH (person: Person) WHERE person.age < 18 RETURN person

In this example we added the WHERE clause to the query to narrow our results. WHERE works very similarly to its SQL equivalent: MATCH (person: Person) finds all nodes with the Person label, and then the WHERE clause filters values out of the result set.

Modeling direction in relationships

We have four nodes, so let's create some relationships. First of all, let's create the IS_MARRIED_TO relationship between Steven and Linda:

MATCH (steven:Person {name: "Steven"}), (linda:Person {name: "Linda"}) CREATE (steven)-[:IS_MARRIED_TO]->(linda) return steven, linda

In this example we match two Person nodes labeled Steven and Linda, and we create a relationship of type IS_MARRIED_TO from Steven to Linda. The format for creating the relationship is as follows:

(node1)-[relationshipVariable:RELATIONSHIP_TYPE->(node2)
1 2 Page 1
Page 1 of 2