Labels are usually defined using an enumeration, but Neo4j just requires those labels that implement the Label
interface.
So, we have to define the relationships. This is usually done using the enum
function, as shown in the following code snippet:
Creating nodes and relationships using the Java API
The next step is to fill in the database. First of all, to work with Neo4j using the Java API, we always need a transaction created from the GraphDatabaseService
class. While building with Java 7, you can use the following syntax:
The first line in the preceding code creates a transaction named tx
. The call to success
marks the transaction successful; every change will be committed once the transaction is closed. If an exception is thrown from inside the try
statement, the transaction automatically ends with a rollback. When you use Java 6, the code is a little longer because you have to close the transaction explicitly within a finally
clause, as shown in the following code:
Now, in our application, cost centers are identified only by their code, while employees can have the following properties:
Our relationships (REPORTS_TO
, BELONGS_TO
, and MANAGER_OF
) can have a property (From
) that specifies the dates of validity. The following code creates some examples of nodes and the relationships between them, and then sets the property values of nodes and some relationships:
In the preceding code, we used the following functions of the GraphDatabaseService
class:
createNode
: This creates a node and then returns it as result. The node will be created with a long, unique ID.
Note
Unlike relational databases, node IDs in Neo4j are not guaranteed to remain fixed forever. In fact, IDs are recomputed upon node deletion, so don't trust IDs, especially for long operations.
createRelationshipTo
: This creates a relationship between two nodes and returns that relationship in a relationship instance. This one too will have a long, unique ID.
setProperty
: This sets the value of a property of a node or a relationship.
We put the time in milliseconds in the property because Neo4j supports only the following types or an array of one of the following types:
boolean
byte
short
int
long
float
double
String
To store complex types of arrays, we can code them using the primitive types, as seen in the preceding list, but more often than not, the best approach is to create nodes. For example, if we have to store a property such as the entire address of a person, we can convert the address in JSON and store it as a string.
This way of storing data in a JSON format is common in document-oriented DBs, such as MongoDB, but since Neo4j isn't a document database, it won't build indexes on the properties of the document. So, for example, it would be difficult or very slow to query people by filtering on any field of the address, such as the ZIP code or the country. In other words, you should use this approach only for raw data that won't be filtered or processed with Cypher; in other cases, creating nodes is a better approach.
A typical report of our application is a list of all the employees. In our database, an employee is a node labeled Employee
, so we have to find all nodes that match with the label Employee
pattern. In Cypher, this can be expressed with the following query:
The MATCH
clause introduces the pattern we are looking for. The e:Employee
expression matches all e
nodes that have the label Employee
; this expression is within round brackets because e
is a node. So, we have the first rule of matching expressions—node expressions must be within round brackets.
With the RETURN
clause, we can specify what we want; for example, we can write a query to return the whole node with all its properties. In this clause, we can use any variable used in the MATCH
clause. In the preceding query, we have specified that we want the whole node (with all its properties). If we are interested only in the name and the surname of the employees, we can make changes only in the RETURN
clause:
If any node does not have either of the properties, a null
value is returned. This is a general rule for properties from version 2 of Cypher; missing properties are evaluated as null
values.
The next question is how to invoke Cypher from Java.
Invoking Cypher from Java
To execute Cypher queries on a Neo4j database, you need an instance of ExecutionEngine
; this class is responsible for parsing and running Cypher queries, returning results in a ExecutionResult
instance:
Note that we use the org.neo4j.cypher.javacompat
package and not the org.neo4j.cypher
package even though they are almost the same. The reason is that Cypher is written in Scala, and Cypher authors provide us with the former package for better Java compatibility.
Now with the results, we can do one of the following options:
Dumping to a string value
Converting to a single column iterator
Iterating over the full row
Dumping to a string is useful for testing purposes:
If we print the dumped string to the standard output stream, we will get the following result:
Here, we have a single column (e) that contains the nodes. Each node is dumped with all its properties. The numbers between the square brackets are the node IDs, which are the long and unique values assigned by Neo4j on the creation of the node.
When the result is a single column, or we need only one column of our result, we can get an iterator over one column with the following code:
Then, we can iterate that column in the usual way, as shown in the following code:
However, Neo4j provides a syntax-sugar utility to shorten the code that is to be iterated:
If we need to iterate over a multiple-column result, we will write this code in the following way:
The iterator
function returns an iterator of maps, where keys are the names of the columns. Note that when we have to work with nodes, even if they are returned by a Cypher query, we have to work in transaction. In fact, Neo4j requires that every time we work with the database, either reading or writing to the database, we must be in a transaction. The only exception is when we launch a Cypher query. If we launch the query within an existing transaction, Cypher will work as any other operation. No change will be persisted on the database until we commit the transaction, but if we run the query outside any transaction, Cypher will open a transaction for us and will commit changes at the end of the query.
Finding nodes by relationships
If you have ever used the Neo4j Java API, you might wonder why we should write the following code:
You can get the same result with the Java API with a single line of code:
However, pattern matching is much more powerful. By making slight changes to the query, we can get very important and different results; for example, we can find nodes that have relationships with other nodes. The query is as follows:
The preceding query returns all employees that have a relation with any cost center:
Again, as you can see, both n
and cc
are within round brackets. Here, the RETURN
clause specifies both n
and cc
, which are the two columns returned. The result would be the same if we specified an asterisk instead of n
and cc
in the RETURN
clause:
In fact, similar to SQL, the asterisk implies all the variables referenced in the patterns, but unlike SQL, not all properties of the entities are involved, just those of the referenced ones. In the previous query, relationships were not returned because we didn't put a variable in square brackets.
By making another slight change to the query, we can get all the employees that have a relation with a specific cost center, for example CC1. We have to filter the code
property as shown in the following code:
If we compare this query with the previous one, we can note three differences, which are listed as follows:
The query returns only the employee node n
because we don't care about the center cost here.
Here, we omitted the cc
variable. This is possible because we don't need to give a name to the cost center that matches the expression.
In the second query, we added curly brackets in the cost center node to specify the property we are looking for. So, this is another rule of pattern-matching expressions: properties are expressed within curly brackets.
The -->
symbol specifies the direction of the relation; in this case, outgoing from n
. In the case of MATCH
expressions, we can also use the <--
symbol for inverse direction. The following expression is exactly equivalent to the previous expression:
The preceding expression will give the same result:
If we don't have a preferred direction, we will use the --
symbol:
In our example, the latter query will return the same result as the previous one because in our model, relationships go from employees to cost centers.
If we wish to know the existing relationships between the employees and cost centers, we will have to introduce another variable:
The variable r
matches any relationship that exists between the employees and cost center CC1 and is returned in a new column:
So, here we have the last rule: relationship expressions must be specified in square brackets.
To filter the employees who belong to a specific cost centre, we have to specify the relationship type:
This query matches any node n
, which has a relation of the BELONGS_TO
type with any node cc
that has the value CC1
as a property code:
We can specify multiple relationships using the |
operator. The following query will search for all employees who belong to or are managers of the cost center CC1:
This time we returned only the name and surname, while the relationship is returned in the second column:
By making a slight change to the query in the preceding code, we can return the manager as well as the employees of the cost center as the result. This can be implemented as shown in the following query:
In this query, we can see the expressivity of Cypher—a very intuitive syntax to translate the "node n
belonging to the cost center having a manager m
" pattern. The result is the following code:
Of course, we can chain an increasing number of relationship expressions to describe very complex patterns:
Another query that is very useful in real-world applications is finding nodes reachable from one node with a certain number of steps and a certain depth. The ability to execute this kind of query, and search the neighborhood, is one of the strong points of graph databases:
This query returns the nodes that you can reach, starting from the Davies
node, by visiting exactly two relationships of the graph. The result contains duplicated nodes because we have several paths to reach each of them:
Tip
To get different values, we can use the DISTINCT
keyword:
This time, we haven't specified any relationship type in the square brackets, so it matches any type. The expression *2
means exactly two steps. With a little change, we can also ask for the relationships we visited:
Of course, by changing the number in the expression, we can get the query to navigate any number of relationships. However, we could also want all the nodes that are reachable from a number of relationships in a range of step numbers, for example, from two to three:
This is very useful in real-world applications such as social networks because it can be used to build lists, for example, a list of people you may know.
If we also want the starting node in the result, we can modify the range to start from 0
:
Dealing with missing parts
In our applications, we often need to get some information related to something that could be missing. For example, if we want to get a list of all employees who have a specific number of employees reporting to them, then we must deal with those employees too who have no employees reporting to them. In fact, we can write:
From this, the following result is obtained:
However, this is not what we are looking for. In fact, we want all the employees, with all the employees that report to them as an option. This type of relation is similar to the OUTER JOIN
clause of SQL and can be done in Cypher using OPTIONAL MATCH
. This keyword allows us to use any pattern expression that can be used in the MATCH
clause, but it describes only a pattern that could match. If the pattern does not match, the OPTIONAL MATCH
clause sets any variable to null
variable:
In this query, we slightly changed the previous one; we just inserted OPTIONAL MATCH (e)
. The effect is that the first part (e:Employee)
must match, but the pattern following OPTIONAL MATCH
may or may not match. So, this query returns any employee e
, and if e
has a relationship of the REPORTS_TO
type with any other employee, this query is returned in m
; otherwise, m
will be a null
value. The result is as follows:
Note
Unlike object-oriented languages where referencing any property of a null object will result in a null-reference exception, in Cypher referencing, which is a property of the null node, we get a null
value again.
Now, let's say that we also want to know whether the employee is the manager of any center cost, and if so, which one. Also, we want to know the cost center of any employee. For this, we can write the following code:
The preceding code returns the following result:
What happened? Does it look like Smith
does not report to Underwood
anymore? This weird result is due to the fact that the whole pattern in OPTIONAL MATCH
must match. We can't have partially matched patterns. Since we can add as many OPTIONAL MATCH
expressions as we want to, we have to write the following code to get the result we are looking for:
In fact, the result is the following code:
This query works because we have two OPTIONAL MATCH
clauses that can independently generate a successful match.
As we have seen earlier, graph databases are useful to find paths between two nodes:
This query uses a construct which we have not used so far—the path assignment, path =
. The assignment of variables can be done only with paths. Note that the query in the preceding code returns all the possible paths from two nodes. Here, the result is two paths in our database:
However, what if we need the shortest path between them? The shortest path is the path with the least number of nodes visited. Clearly, we could iterate over all the paths and take the shortest, but Cypher provides a function that does the work for us:
Let's see what is new in this query:
MATCH
: In this clause, we have two node expressions (in round brackets) separated by a comma. These expressions, a
and b
, match any node independently, just like a Cartesian product.
RETURN
: In this clause, we have to call the allShortestPath
function that takes an expression as a parameter. The expression is a variable length relation (this is the asterisk between the square brackets). Here, we don't care about relationship types and the direction, but we can filter properties, relation types involved, and so on, if necessary.
RETURN
: In this clause, we have an alias. An alias must be defined using the keyword AS
. It just specifies the name of the column returned.
Node IDs as starting points
When we execute a query like the previous code, Cypher must find the nodes and relationships that match the pattern. However, to do so, it must start to search from a set of nodes or relationships. We can let Cypher find the starting points of a query on its own, but we can also specify them because we want to search a pattern that starts from a specific node, or a specific relation resulting in an important improvement in the performances of the query.
We can assign starting points to variables in the query using the START
keyword. The previous query, for example, could be rewritten in the following way:
If we execute this query, and compare the time elapsed in executing this query and the previous one, we can easily prove that the latter is dramatically faster. The drawback is that we need to know the ID of the node.