Hot questions for Using Neo4j in python

Question:

I am working with an application that uses a Neo4J graph containing about 10 million nodes. One of the main tasks that I run daily is the batch import of new/updated nodes into the graph, on the order of about 1-2 million. After experimenting with Python scripts in combination with the Cypher query language, I decided to give the embedded graph with Java API a try in order to get better performance results.

What I found is about a 5x improvement using the native Java API. I am using Neo4j 2.1.4, which I believe is the latest. I have read in other posts that the embedded graph is a bit faster, but that this should/could be changing in the near future. I would like to validate my findings with anyone who has observed similar results?

I have included snippets below just to give a general sense of methods used - code has been greatly simplified.

sample from cypher/python:

cnode = self.graph_db.create(node(hash = obj.hash,
    name = obj.title,
    date_created = str(datetime.datetime.now()),
    date_updated = str(datetime.datetime.now())
))

sample from embedded graph using java:

final Node n = Graph.graphDb.createNode();
for (final Label label : labels){
    n.addLabel(label);
}
for (Map.Entry<String, Object> entry : properties.entrySet()) {
    n.setProperty(entry.getKey(), entry.getValue());
}

Thank you for your insight!


Answer:

What you're actually doing here is comparing the speeds of two different APIs and merely using two different languages to do that. Therefore, you're not comparing like for like. The Java core API and the REST API used by Python (and other languages) have different idioms, such as explicit vs implicit transactions. Additionally, network latency associated with the REST API will make a great difference, especially if you are using one HTTP call per node created.

So to get a more meaningful performance comparison, make sure you are comparing like for like: use Java via the REST API perhaps or use Cypher for both tests.

Hint 1: you will get better performance in general over REST by batching up a number of requests into a single API call.

Hint 2: the REST API will never be as fast as the core API as the latter is native and the former has many more layers to go through.

Question:

In a query, I supply the following parameters:

id : 'some unique ID',
used : [an array of md5 checksums]

with the following query:

MATCH (a {id:{id}}) , (b)
WHERE b.md5 IN {used}
CREATE UNIQUE (a)-[]->(b)

and everything is wonderful. If there are 10 MD5 checksums in the used array, 10 relationships will be made to them from node "a". Cool.

But now say I have the need to add a property to that relationship - and that property will depend on the node b.

So now I have an extra parameter, an object, that looks like this:

info : {
'5fb1be1279031c1f1c65a928eb823e51': 'yolo',
'0aab9f8e81684ec778f8c0c5717f37c2': 'swag',
...
}

The MD5 keys in this object match the MD5 strings in the used array.

My first instinct would be to do this:

MATCH (a {id:{id}}) , (b)
WHERE b.md5 IN {used}
CREATE UNIQUE (a)-[{ meme:{info}[b.md5] }]->(b)

Because that doesn't work. I get the error:

{ [neo4j.ClientError: [Neo.ClientError.Statement.InvalidType] Expected e1701806eda7d3ab52b143cc03d94e75 to be a java.lang.Number, but it was a java.lang.String] message: '[Neo.ClientError.Statement.InvalidType] Expected e1701806eda7d3ab52b143cc03d94e75 to be a java.lang.Number, but it was a java.lang.String', neo4j: { code: 'Neo.ClientError.Statement.InvalidType', message: 'Expected e1701806eda7d3ab52b143cc03d94e75 to be a java.lang.Number, but it was a java.lang.String' }, name: 'neo4j.ClientError' }

If anyone could help i would be immensely grateful because I'm totally and utterly stuck on this :/


Answer:

After taking a good look at Stefan's blog, I found I could use the following Cypher to achieve what I need. It's not pretty, but until there is an easier way to conditionally CREATE things than FOREACH/CASE tricks, it will have to do:

First split your object of key/value pairs into two arrays:

> fileMD5  : ['5fb1be1279031c1f1c65a928eb823e51','0aab9f8e81684ec778f8c0c5717f37c2'..]
> fileInfo : ['yolo','swag'...]

Then write the expression as so:

MATCH (e:event {id:{id}}),(r:resource)
WHERE r.md5 IN {used}
FOREACH(
  idx in RANGE(0,SIZE({fileMD5})-1) |
  FOREACH( 
    filePath IN CASE WHEN r.md5 = {fileMD5}[idx] THEN [{fileInfo}[idx]] ELSE [] END |
    CREATE UNIQUE (r)-[:USED_BY {filePath:filePath }]->(e)
  )
)

I wrote this question 8 hours ago, which means these few little lines took 8 hours of hair-pulling to get. I hope someone else finds it useful :P

EDIT: A bit of an explanation as to how/why it works...

The first FOREACH iterates a RANGE from 0 to (the length of either the key or value array -1), with the idx variable being the result. This is a common trick for iterating two arrays of the same length simultaneously.

The second FOREACH is a bit more complicated. It iterates an array, which is created by the CASE, with the result being called filePath. The CASE however only ever returns either an array with nothing in it [], or and array with the value we want to set in our CREATE. So depending on the CASE, the FOREACH will either do something once, or do nothing at all.

The CASE is very simple. When the index pulls out a key that matches what we want (r.md5 = {fileMD5}[idx]) then it returns an array with one value - the value in the value array using the same index as what matched in the key array.

Question:

The joern documentation says:

It is possible to access the graph database directly from your scripts by loading the database into memory on script startup.

How can you do that?

After running java -jar $JOERN/bin/joern.jar $CodeDirectory over my code, a Neo4J database directory (.joernIndex) is created with all these .id- and .db-files. Is it possible to access my code (with python-joern) without running a neo4j server? (Is the server necessary?)


Answer:

The web interface way to use the Joern database is documented here:

http://joern.readthedocs.io/en/latest/import.html

And the python-joern interface is documented here:

http://joern.readthedocs.io/en/latest/access.html#python-joern-api

And the program:

from joern.all import JoernSteps

j = JoernSteps()

j.setGraphDbURL('http://localhost:7474/db/data/')

# j.addStepsDir('Use this to inject utility traversals')

j.connectToDatabase()

res =  j.runGremlinQuery('getFunctionsByName("main")')
res =  j.runCypherQuery('...')

for r in res: print r

basically the URL-way to talk to the Neo4J server, and this is called Joern's "REST API".

Now if you want to access the database "directly", you can using some Java program as shown here:

Loading all Neo4J db to RAM

Or some python as shown here:

https://neo4j.com/developer/python/

https://marcobonzanini.com/2015/04/06/getting-started-with-neo4j-and-python/

But bottomline you are still going to start the Neo4J database server, and your program (via neo4j driver, which enable network based communication possible) talk to the database server.

But if you want to load the "database" files directly, parsed it yourself, and extract out the data then it is going to be hard.