Importing data from the CSV format to Neo4j
Graph data comes in different formats, and sometimes it's a combination of two or more formats. It is very important to learn about the various ways to import data, which is in different formats into Neo4j. In this recipe, you will learn how to import data present in the CSV file format into the Neo4j graph database server. A sample CSV file is shown as follows:
Getting ready
To get started with this recipe, install Neo4j by using the steps from the earlier recipes of this chapter.
How to do it...
There are several methods that you can use to import data which is in the CSV format or Excel into Neo4j, which are described in the sections that follow.
Using a batch importer
There is excellent tool written by Michael Hunger, which can be cloned from https://github.com/jexp/batch-import.
The CSV file has to be converted into the format specified in the readme
file. The tool is very flexible in terms of the number of properties and the types of each property. The nodes and relationships can be within the same file or within multiple files. The example file format is present in the sample directory. To run the tool, use the following command:
$ wget https://dl.dropboxusercontent.com/u/14493611/batch_importer_22.zip $ unzip batch_importer_22.zip # Download sample nodes.csv and rels.csv from the github repo under sample $ import.sh test.db nodes.csv rels.csv $ cp test.db ${NEO4J_ROOT}/data/graph.db
Each parameter in the command has been fully explained in the readme
file.
Note
The batch import tool also supports a parallel batch inserter, which can speed up the process of importing data from a large number of nodes and relationships.
Benchmark figures claimed by the batch importer tool are 2 billion nodes and 20 billion relationships in 11 hours (500K elements/second).
This is claimed over the EC2 high I/O instance.
Using custom scripts
Custom scripts can be written in any language to import data from CSV files. Custom scripts give you the advantages of checking various erroneous scenarios, leaving out redundant columns, and other flexibilities. For a smaller number of nodes and relationships, custom scripts can be written in any language of your choice.
The exact format of the script will depend on the CSV file. You can write the script as follows:
#Bash Script for importing nodes NEO4J_ROOT="/var/lib/neo4j" while read LINE do name=`echo $LINE | awk -F "," '{print $3}'` ${NEO4J_ROOT}/bin/neo4j-shell -c mknode --np \"{'name':$name}\" -v done
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Similar scripts can be written for relationships too, as shown here:
#Bash Script for creating relationships #Format of csv should be startnode,endnode,type,direction NEO4J_ROOT="/var/lib/neo4j" IFS="," while read LINE do echo $LINE array=($LINE) ${NEO4J_ROOT}/bin/neo4j-shell -c cd -a ${array[0]} mkrel -d ${array[3]} -t ${array[2]} ${array[1]} done
This task can also be achieved in Python using the py2neo module, as shown in the following script:
#Sample Python code to create nodes from csv file import csv from py2neo import neo4j, cypher from py2neo import node, rel graph_db = neo4j.Graph("http://localhost:7474/db/data/") ifile = open('nodes.csv', "rb") reader = csv.reader(ifile) rownum = 0 for row in reader: nodes = graph_db.create({"name":row[2]}) ifile.close()
A similar Python code can be written for creating relationships, too. The py2neo module can also be used to create a batch request, wherein there's a whole list with parameters as shown in the following code:
records = [(101, "A"), (102, "B"), (103, "C")] graph_db = neo4j.Graph ("http://localhost:7474/db/data/") batch = neo4j.WriteBatch(graph_db) for emp_no, name in records: batch.get_or_create_indexed_node("Employees", "emp_no", emp_no,{ "emp_no": emp_no, "name": name }) nodes = batch.submit()
How it works...
Batch import performance is achieved by skipping all the transactional behavior and losing ACID guarantees. If the batch import fails, the database will be broken, possibly irrecoverably, and lead to the loss of all the information.
See also
Custom scripts can be written for REST as well as for the embedded interfaces of Neo4j. For the full cookbook on py2neo recipes, refer to http://py2neo.org/2.0/cookbook.html.