Hot questions for Using Neo4j in heap

Question:

i have a problem with neo4j. I don't know if problem is my query or something else.


Intro

I have to build an application that store bus/train routes. This is my schema:

Nodes:

  • Organizaton: company that have routes/bus etc..
  • Route: A bus route like: Paris - Berlin.
  • Vehicle(Bus in this case): Fisical bus with a unique license plate.
  • Stops: point in a map with latitude and longitude.

Important Relationships:

  • NEXT: This is a really important relationship.

NEXT relationships contains those properties:

  • startHour
  • startMinutes
  • endHour
  • endMinutes
  • dayOfWeek (from 0 to 6 - Sun, Mon etc..)
  • vehicleId


Problem

My query is:

MATCH (s1:Stop {id: {departureStopId}}), (s2:Stop {id: {arrivalStopId}})
OPTIONAL MATCH (s1)-[nexts:NEXT*]->(s2)
WHERE ALL(i in nexts WHERE toInt(i.dayOfWeek) = {dayOfWeek} AND toInt(i.startHour) >= {hour})
RETURN nexts
LIMIT 10

For example: I wanna found all nexts relationships where dayOfWeek is Sunday (0) and property startHour > 11

After that I usually parse and validate final object on my nodejs backend.

This works when i was at the start.. with 1k relationships.. Now i have 10k relationships and my query have a TIMEOUT problem or queries are solved in 30s.. too much time... I have no idea how to solve this. I use neo4j with docker and i tried to read settings docs but i have no idea how Java works.

Can you help me guys?


UPDATE

Thank you all guys! For now i solved with "allShortestPaths" but I think i will rename all relationships (like Michael Hunger said).


Answer:

Have you tried:

MATCH p=allShortestPaths((s1:Stop {id: {departureStopId}})-[:NEXT*]-> (s2:Stop {id: {arrivalStopId}}) )
WHERE ALL(i in RELS(p) WHERE toInt(i.dayOfWeek) = {dayOfWeek} AND toInt(i.startHour) >= {hour})
RETURN rels(p) as nexts
LIMIT 10

This should use the fast shortest path algorithm because:

Planning shortest paths in Cypher can lead to different query plans depending on the predicates that need to be evaluated. Internally, Neo4j will use a fast bidirectional breadth-first search algorithm if the predicates can be evaluated whilst searching for the path.

See https://neo4j.com/docs/developer-manual/current/cypher/execution-plans/shortestpath-planning/#_shortest_path_with_fast_algorithm for more details.

Question:

I'm trying to understand Neo4j object cache by some investigation into it. My first impression of Object cache come from the slides in this link: http://www.slideshare.net/thobe/an-overview-of-neo4j-internals

Specifically the Node/Relationship object in cache should look like slide 9 or 15/42. To verify this, I wrote a simple server script using existing graph database contents. The way I do it is trying to look into the starting virtual address of the node/relationship object using sun.misc.Unsafe. The program for obtaining virtual address is from the following link: How can I get the memory location of a object in java?

public static long addressOf(Object o) throws Exception {
    Object[] array = new Object[] { o };

    long baseOffset = unsafe.arrayBaseOffset(Object[].class);
    int addressSize = unsafe.addressSize();
    long objectAddress;
    switch (addressSize) {
    case 4:
        objectAddress = unsafe.getInt(array, baseOffset);
        break;
    case 8:
        objectAddress = unsafe.getLong(array, baseOffset);
        break;
    default:
        throw new Error("unsupported address size: " + addressSize);
    }
    return (objectAddress);
}

And in the neo4j server script (My main() class), I get node address by id and print out the address in the following way:

void checkAddr(){
    nodeAddr(0);
    nodeAddr(1);
    nodeAddr(2);
}

void nodeAddr(int n){
    Node oneNode = graphDb.getNodeById(n);
    Node[] array1 = {oneNode};

    try {
        long address = UnsafeUtil.addressOf(array1);
        System.out.println("Addess: " + address);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

To begin with, I tried with Soft cache provider, which is the default case. The addresses get printed out for node object 0, 1 and 2 is:

Addess: 4168500044 Addess: 4168502383 Addess: 4168502753

Therefore, Using second address - first address and third address - second address, I can know exactly how much space a node is taking. In this case, first node object takes 2339B and second take 370B.

Then, to see the impact of disabling object cache, I does the setting with NoCacheProvider:

setConfig(GraphDatabaseSettings.cache_type,NoCacheProvider.NAME)

The addresses get printed out is:

Addess: 4168488391 Addess: 4168490708 Addess: 4168491056

The offset, calculated similarly as in first case is: first node object takes 2317B and second takes 348B.

Here comes my problem:

  1. Since I'm using the same graph and doing read only queries, why is the size of the same node object changing?

  2. When I disabled the object cache, why is the address offset look the same as if there is object cache exists? For example, in the node store file, a single node takes 9 bytes, which is not the case in my experiment. If the way I'm getting node object is problematic, how can I obtain virtual address in a correct way? And is there any way I can know specifically where does the mmap node file resides in memory?

  3. How could I know exactly what is stored in a node object. When I looked at Node.class at this link: https://github.com/neo4j/neo4j/blob/1.9.8/community/kernel/src/main/java/org/neo4j/graphdb/Node.java It doesn't seem that a node object should look the same way as it is in the presentation slides. Rather just a group of functions used by node object. Further is a node object brought into memory as a whole at once in both no-object-cache and with-object-cache occasion?


Answer:

The Node object is not what Neo4j stores in the "object cache", so you are not going to gain much insight into the caching of Neo4j by looking at those instances. The implementations of Node that Neo4j gives you are instances of a class called NodeProxy, and are as small as they can possibly be (two fields: internal id and reference to the database). These just serve as your handle of the node for performing operations around that node in the database. The objects stored in the "object cache" are instances of a class called NodeImpl (and despite the name they do not implement the Node interface). The NodeImpl objects have the shape that's outlined on the 15th slide (with page number 9 within the slide) in that presentation. Well, it roughly has that shape, Neo4j has evolved since I made those slides.

Neo4j evolving has also changed the number of bytes that node records occupy on disk. Neo4j 2.0 and later have slightly larger node records than what those slides present. If you are interested in looking at the layout of those records, you should look at the NodeRecord class, then start from NodeStore class and "downwards" into its dependencies to find the memory mapping.

Besides looking at the wrong object for seeing the difference between different cache approaches in Neo4j your approach of measuring is flawed. Comparing the addresses of objects does not tell you anything about the size of those objects. The JVM makes no guarantees that two objects allocated one after the other (in time) will reside adjacently in memory, and even if the JVM did utilise such an allocation policy, Neo4j might have allocated multiple objects in between the allocations of the two objects you are comparing. Then there is the garbage collector, which might have moved the objects around in between you getting the address of one object and you getting the address of the next object. Thus looking at the addresses of objects in Java is pretty much never useful for anything. For a better approach at measuring the size of an object in Java, take a look at the Java Object Layout utility, or use the Instrumentation.getObjectSize(...) method from a Java agent.

To answer you questions as stated:

  1. The sizes of the node objects are not changing, their addresses are not guaranteed to be the same in between runs. As per my description above you cannot rely on object address to compute object size.

  2. Since you are looking at NodeProxy objects they will look the same regardless of what caching strategy Neo4j uses. In order to look at the NodeImpl objects you have to dig quite deep into the internals of Neo4j. Since it looks like you are using Neo4j 1.9 you would cast the GraphDatabaseService instance that you have to GraphDatabaseAPI (an interface that is internal to the implementation) then invoke the getNodeManager() method on that object. from the NodeManager you can call getNodeIfCached( node.getId() ) to get a NodeImpl object. Please note that this API will not be compatible between versions of Neo4j, and using it is one of those "warranty void if seal broken" kind of situations...

  3. Look at the source code for NodeImpl instead. As to when and how data is brought into cache, Neo4j tries to be lazy about that, only loading the data you use. If you are getting the relationships of a node, those will be loaded into the cache, and if you are getting properties, those will be loaded into the cache. If you only get relationships, the properties will never be loaded and vice versa.

Question:

I have successfully migrated dblp dataset in to neo4j database and i use neo4jShell for running the cypher quires. The database has millions of nodes and relations between publications and authors. Now when i try to run a query on neo4j database it takes 10 to 12 hours for processing and then ended up with this error

Error occurred in server thread; nested exception is : java.lang.OutOfMemoryError: Java heap space

i am using neo4j community edition version 2.2.3, jdk 1.7 machine with 8 gb of memory and core i7 processor.

Query :

neo4j-sh (?)$ MATCH (p:`publication`)-[:`publishedby`]->(a:`author`)
RETURN p.year, p.type, a.id, count(*) order by a.id desc LIMIT 25;

Experts please advice me any way out from this exception.


Answer:

As your dataset is a public dataset it would be very helpful if you could share your database.

In general you are computing many million or billion paths, which you are aggregating after the fact, that just takes a while. Combined with probably too little memory and a slow disk it takes a long time to load the data from disk.

This is a global graph query, you can see that if you run it prefixed with PROFILE.

Make sure your id property is numberic !

I would change the query like this:

// this is the expensive operation, to order millions of authors by id
// still, do it and take the top 25
MATCH (a:author) WITH a order by a.id LIMIT 25
// find publications for the top 25 authors
MATCH (a)<-[:publishedby]-(p)
// return aggregation
RETURN a.id, p.year, p.type, count(*)
LIMIT 25;

To start neo4j-shell with sensible memory settings:

  • stop the server
  • edit conf/neo4j-wrapper.conf, set min and maxmemory to 4000
  • edit conf/neo4j.properties set dbms.pagecache.memory=3G
  • start the server, run bin/neo4j-shell

if you run neo4j-shell in standalone mode, stop the server and use this:

export JAVA_OPTS="-Xmx4000M -Xms4000M -Xmn1000M" 
bin/neo4j-shell -path data/graph.db -config conf/neo4j.properties

Question:

I'm looking for a solution of the problem of java heap space memory error. It begans to make me tired, so I'd like to help me to find where can I set the value of the heap space java .. I'm working on the project of rabbithole and I dont know where I could find the specified java heap space memory. I'm also developing on eclipse and I already moved the value of max memory but nothing seems to be working.

Is there anyone who can help me as soon as possible ?

Thank you all


Answer:

I assume you use mvn jetty:run to start rabbithole. In this case you can use the environment variable MAVEN_OPTS to pass in additional JVM parameters. E.g. when you want to set heap space to 4GB (both max and initial size) use:

MAVEN_OPTS="-Xmx4G -Xms4G" mvn jetty:run

Question:

Does anyone know how to increase Java heap size one Neo4j?

I'm loading a csv file into Neo4j Database and I can't load it because I need to increase Java Heap Size... I checked that I have more than 100GB free... I also have 16Gb of RAM. My Jar file is located in "C:\Program Files\Neo4j CE 3.1.4\bin\neo4j-desktop-3.1.4.jar


Answer:

You can set the neo4j.conf, which is located in C:\Program Files\Neo4j CE 3.1.4\conf\ by uncommenting and settings these lines:

# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size.
#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m

# The amount of memory to use for mapping the store files, in bytes (or
# kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g').
# If Neo4j is running on a dedicated server, then it is generally recommended
# to leave about 2-4 gigabytes for the operating system, give the JVM enough
# heap to hold all your transaction state and query context, and then leave the
# rest for the page cache.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 50% of RAM minus the max Java heap size.
#dbms.memory.pagecache.size=10g

Know that if you are using LOAD CSV functionality, you can batch your transaction with USING PERIODIC COMMIT prefix. Example from the docs

USING PERIODIC COMMIT 1000
LOAD CSV FROM 'https://neo4j.com/docs/developer-manual/3.1/csv/artists.csv' AS line
CREATE (:Artist { name: line[1], year: toInt(line[2])})

Question:

I am trying to test some queries on some neo4j databases, different in the amount of data. If I test that queries on small amount of data, everything goes right and execution time is small, but when I start execute queries on database with 2794 nodes and 94863 relations, It will take a long time to get following error in Neo4j API: Java heap space Neo.DatabaseError.General.UnknownFailure enter image description here First query:

    MATCH (u1:User)-[r1:Rated]->(m:Movie)<-[r2:Rated]-(u2:User)
WITH 1.0*SUM(r1.Rate)/count(r1) as pX, 
1.0*SUM(r2.Rate)/count(r2) as pY, u1, u2
MATCH (u1:User)-[r1:Rated]->(m:Movie)<-[r2:Rated]-(u2:User)
WITH SUM((r1.Rate-pX)*(r2.Rate-pY)) as pomProm,
SQRT(SUM((r1.Rate-pX)^2)) as sumX, 
SQRT(SUM((r2.Rate-pY)^2)) as sumY, pX,pY,u1,u2
CREATE UNIQUE (u1)-[s:SIMILARITY1]-(u2)
SET s.value = pomProm / (sumX * sumY)

And second query

    MATCH (u1:User)-[r1:Rated]->(m:Movie)<-[r2:Rated]-(u2:User)
WITH SUM(r1.Rate * r2.Rate) AS pomProm,
SQRT(REDUCE(r1Pom = 0, i IN COLLECT(r1.Rate) | r1Pom + toInt(i^2))) AS r1V,
SQRT(REDUCE(r2Pom = 0, j IN COLLECT(r2.Rate) | r2Pom + toInt(j^2))) AS r2V,
u1, u2
CREATE UNIQUE (u1)-[s:SIMILARITY2]-(u2)
SET s.value = pomProm / (r1V * r2V)

Data in database are generated from following Java code:

public enum Labels implements Label {
    Movie, User
}

public enum RelationshipLabels implements RelationshipType {
    Rated
}

public static void main(String[] args) throws IOException, BiffException {
    Workbook workbook = Workbook.getWorkbook(new File("C:/Users/User/Desktop/DP/dvdlist.xls"));
    Workbook names = Workbook.getWorkbook(new File("C:/Users/User/Desktop/DP/names.xls"));
    String path = new String("C:/Users/User/Documents/Neo4j/test7.graphDatabase");
    GraphDatabaseFactory dbFactory = new GraphDatabaseFactory();
    GraphDatabaseService db = dbFactory.newEmbeddedDatabase(path);
    int countMovies = 0;
    int numberOfSheets = workbook.getNumberOfSheets();
    IndexDefinition indexDefinition;
    try (Transaction tx = db.beginTx()) {
        Schema schema = db.schema();
        indexDefinition = schema.indexFor(DynamicLabel.label(Labels.Movie.toString()))
                .on("Name")
                .create();
        tx.success();
    }
    try (Transaction tx = db.beginTx()) {
        Schema schema = db.schema();
        indexDefinition = schema.indexFor(DynamicLabel.label(Labels.Movie.toString()))
                .on("Genre")
                .create();
        tx.success();
    }
    try (Transaction tx = db.beginTx()) {
        Schema schema = db.schema();
        indexDefinition = schema.indexFor(DynamicLabel.label(Labels.User.toString()))
                .on("Name")
                .create();
        tx.success();
    }
    try (Transaction tx = db.beginTx()) {

        for (int i = 0; i < numberOfSheets; i++) {
            Sheet sheet = workbook.getSheet(i);
            int numberOfRows = 6000;//sheet.getRows();
            for (int j = 1; j < numberOfRows; j++) {
                Cell cell1 = sheet.getCell(0, j);
                Cell cell2 = sheet.getCell(9, j);
                Node movie = db.createNode(Labels.Movie);
                movie.setProperty("Name", cell1.getContents());
                movie.setProperty("Genre", cell2.getContents());

                countMovies++;

            }

        }
        tx.success();
    } catch (Exception e) {
        System.out.println("Something goes wrong!");
    }

    Random random = new Random();
    int countUsers = 0;
    Sheet sheetNames = names.getSheet(0);
    Cell cell;
    Node user;

    int numberOfUsers = 1500;//sheetNames.getRows();
    for (int i = 0; i < numberOfUsers; i++) {
        cell = sheetNames.getCell(0, i);
        try (Transaction tx = db.beginTx()) {
            user = db.createNode(Labels.User);
            user.setProperty("Name", cell.getContents());
            List<Integer> listForUser = new ArrayList<>();

            for (int x = 0; x < 1000; x++) {
                int j = random.nextInt(countMovies);
                if (!listForUser.isEmpty()) {
                    if (!listForUser.contains(j)) {
                        listForUser.add(j);
                    }
                } else {
                    listForUser.add(j);
                }
            }
            for (int j = 0; j < listForUser.size(); j++) {
                Node movies = db.getNodeById(listForUser.get(j));
                int rate = 0;

                rate = random.nextInt(10) + 1;

                Relationship relationship = user.createRelationshipTo(movies, RelationshipLabels.Rated);
                relationship.setProperty("Rate", rate);

            }
            System.out.println("Number of user: " + countUsers);
            tx.success();
        } catch (Exception e) {
            System.out.println("Something goes wrong!");
        }
        countUsers++;
    }

    workbook.close();
}

}

Does anyone know, how to solve this issue? Or there is some walkaround, how to get results from that queries on a database with a large amount of data? Or some query or settings improvement? I will really appreciate it.


Answer:

You may need to configure the amount of memory available to Neo4j. You can configure Neo4j server heap size by editing conf/neo4j-wrapper.conf:

wrapper.java.maxmemory=NUMBER_OF_MB_HERE

See this page for more info.

However, looking at your queries (which are doing graph global all-pairs operations) you might want to consider doing them in batches. For example:

// Find users with overlapping movie ratings
MATCH (u1:User)-[:RATED]->(:Movie)<-[:RATED]-(u2:User)
// only for users whose similarity has not yet been calculated
WHERE NOT exists((u1)-[:SIMILARITY]-(u2))
// consider only up to 50 pairs of users
WITH u1, u2 LIMIT 50
// compute similarity metric and set SIMILARITY relationship with coef
...

Then execute this query repeatedly until you have computed the similarity metric for all users with overlapping movie ratings.

Question:

I understand that explicit configuration of mapped memory, cache and heap is necessary when running Neo4j with large graphs.

Please provide me with some pointers on how can I change these settings? I realise you need to test with different settings, but what is a good starting point?

Neo4j Community version: 2.2RC01 Java Embedded database

Machine: 8GB RAM

Graph size: 20M nodes(5 properties), 220M edges(2 properties)


Answer:

See the manual for the config, for RC01 you only have to set the page-cache size, e.g. to 2G

dbms.pagecache.memory=2g

You can provide the settings to

new GraphDatabaseFactory() .newGraphDatabaseBuilder(PATH) .setConfig(config) .newDatabase()

Heap is configured when you run your java program, via JVM parameters.

Question:


Answer:

You'll want to take a look at using PERIODIC COMMIT when loading, or by using the import tool.