Hot questions for Using Neo4j in import

Question:

Is there a good way to use the Neo4j Java API to migrate some data from one database to another? My use case is to load a few thousand nodes into a temporary database, do a bunch of transformations, then export the results to the main database and delete the temporary one.

I don't want to clobber the data in the destination db, this is an additive process. I see lots of people on the internet (e.g. here) saying "just copy the data directory to the new location", but of course that would clobber the destination.

UPDATE - I experimented with neo4j-shell -path tmpDir -c "DUMP MATCH n RETURN n;" | neo4j-shell -path dbDir -file -, but it's really horribly slow. Generating the output seems fast enough but slurping it back in is glacial, even on a fresh empty database.


Answer:

There are a number of options:

  1. You can just open two neo4j databases in your java copy and use the Java API to transfer nodes and relationships from one to another.

  2. On low level for initial seeding you can do the same with batch-inserter-apis, like I did here: https://github.com/jexp/store-utils/tree/21

  3. you can export cypher results to CSV (e.g. from the browser) and import it again using for instance LOAD CSV

  4. You can use neo4j-shell-tools for some of those import-export tasks e.g. exporting to GraphML or CSV and importing it back again

Question:

I have to import a huge dataset and i use the neo4j-import tool. However my dataset follows the structure below :

1,"lorem1", "ipsum1","foo1"
2,"lorem2", "ipsum2","foo2"
3,"lorem3", "ipsum3","\"

And it throws an error when it reads "\".I know it's the backslash character which means it is waiting a quotation. I would like to know if it's possible to deactivate the backslash meaning in the neo4j-import tool ?


Answer:

I think that this is an issue with the tool that you used to export the CSV having a different idea what the CSV standard is than Neo4j. This is understandable because the CSV format isn't well standardized ;)

According to neo4j-import --help:

--quote <quotation-character>
  Character to treat as quotation character for values in CSV data. The default
  option is ". Quotes inside quotes escaped like """Go away"", he said." and "\"Go
  away\", he said." are supported. If you have set "'" to be used as the quotation
  character, you could write the previous example like this instead: '"Go away",
  he said.'

So this means that Neo4j allows you to escape quotes both with double quoting ("") and with backslash escaping (\"). Normally what I've seen is one or the other, and of course both sides need to agree on what the format is.

I would guess if you escaped your backslashes like this that it might work:

1,"lorem1", "ipsum1","foo1"
2,"lorem2", "ipsum2","foo2"
3,"lorem3", "ipsum3","\\"

You could also use that command to change your quote character (say to ' or even |). Of course you would need to re-generate your CSV with that in mind.

But no, it doesn't look like there's a way to have Neo4j change the way that it interprets escaped quotes in CSV.

Question:

I'm trying to import data from CSVs using the import tool with 2.2.0. I keep running into this error from the messages.log file:

2015-02-10 16:14:44.792+0000 INFO  [org.neo4j]: Import starting
2015-02-10 16:14:45.032+0000 INFO  [org.neo4j]: Creating new db @ C:\path\to\file\Neo4j\test.graphdb\neostore
2015-02-10 16:14:47.727+0000 ERROR [org.neo4j]: Error during import Missing header of type START_ID
java.lang.RuntimeException: Missing header of type START_ID
    at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:61)
    at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.anyStillExecuting(ExecutionSupervisor.java:70)
    at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.finishAwareSleep(ExecutionSupervisor.java:93)
    at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.supervise(ExecutionSupervisor.java:55)
    at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:263)
    at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:153)
    at org.neo4j.tooling.ImportTool.main(ImportTool.java:213)
Caused by: org.neo4j.unsafe.impl.batchimport.input.MissingHeaderException: Missing header of type START_ID
    at org.neo4j.unsafe.impl.batchimport.input.csv.DataFactories$AbstractDefaultFileHeaderParser.validateHeader(DataFactories.java:366)
    at org.neo4j.unsafe.impl.batchimport.input.csv.DataFactories$AbstractDefaultFileHeaderParser.create(DataFactories.java:315)
    at org.neo4j.unsafe.impl.batchimport.input.csv.InputGroupsDeserializer.createNestedIterator(InputGroupsDeserializer.java:65)
    at org.neo4j.unsafe.impl.batchimport.input.csv.InputGroupsDeserializer.createNestedIterator(InputGroupsDeserializer.java:35)
    at org.neo4j.helpers.collection.NestingIterator.fetchNextOrNull(NestingIterator.java:67)
    at org.neo4j.helpers.collection.PrefetchingIterator.peek(PrefetchingIterator.java:60)
    at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:46)
    at org.neo4j.unsafe.impl.batchimport.staging.IteratorBatcherStep.nextOrNull(IteratorBatcherStep.java:41)
    at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:72)
    at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)

the first five lines of the three files that I'm trying to import look like this: loctest.csv

LOCATION:ID;LOC_TYPE:int;NUM_MILE:int;STREET_PRE;STREETNAME;STREETTYPE;STREETSUF;APT_NO;X_STREET;:LABEL
895WTWELFTHST;1;895;W;TWELFTH;ST;;107;;LOCATION
145SFRANKLINST;1;145;S;FRANKLIN;ST;;;;LOCATION
11735GLACIERHWY;1;11735;;GLACIER;HWY;;;;LOCATION
MENDENHALL LOOPUNIVERSITY DRRDUNIVERSITY DR;2;;;MENDENHALL LOOPUNIVERSITY DR;RD;;;UNIVERSITY DR;LOCATION

zip5.csv

ZIP5:ID;ZIP4;:LABEL
99801;;ZIP5
99824;;ZIP5
99821;;ZIP5
99803;;ZIP5

locziptest.csv

:START_ID;CITY;:END_ID;:TYPE
895WTWELFTHST;JUNEAU;99801;CITY
145SFRANKLINST;JUNEAU;99801;CITY
11735GLACIERHWY;JUNEAU;99801;CITY
MENDENHALL LOOPUNIVERSITY DRRDUNIVERSITY DR;JUNEAU;99801;CITY

the offending file seems to be the relationships file (locziptest.csv), but the header looks like it is configured correctly. is the issue with the lookup string in the ID field? does it need to be entirely numeric?


Answer:

I gave this a try and imported your sample data using the arguments:

--into graph.db --nodes loctest.csv --nodes zip5.csv --relationships locziptest.csv --delimiter ; --array-delimiter "|"

and please don't quote the semicolons in the CSV files. I think the discussion was around quotation of the semicolon in the command prompt, which doesn't need to be quoted either

Question:

I'm trying to import a CSV file into Neo4j (Community Edition V 2.3.2). The CSV is structured like this:

id,title,value,updated
123456,"title 1",10,20160407
123457,"title 2",11,20160405

The CSV path is set within the Neo4j properties file.

When I use the following import statement

LOAD CSV WITH HEADERS FROM   
'file:///test.csv' AS line
CREATE (:Title_Node { title: line[1], identifier: toInt(line[0]), value: line[3]})

I receive the following error message:

WARNING: Expected 1 to be a java.lang.String, but it was a java.lang.Long

When I just query the test.csv file with

LOAD CSV WITH HEADERS FROM 'file:///test.csv'
AS line
RETURN line.title, line.id, line.value;

Cypher can access the data without any problem.

+------------------------------------+
| line.title | line.id  | line.value |
+------------------------------------+
| "title 1"  | "123456" | "10"       |
| "title 2"  | "123457" | "11"       |
+------------------------------------+

The effect occurs in the browser as well as in the shell.

I found the following question at Having `Neo.ClientError.Statement.InvalidType` in Neo4j and tried the hints mentioned in the Neo4j Link posted in this answer, but with little success. The CSV file itself seems to be ok by structure (UTF8, no hidden entries etc.).

Every help in solving this is greatly appreciated.

Best

Krid


Answer:

You're supplying the fields for the line header, so use them in the import -

LOAD CSV WITH HEADERS FROM   
'file:///test.csv' AS line
CREATE (:Title_Node { title: line.title, identifier: line.id, value: line.value})

Question:

I am very new to Neo4j and I want to get started with an embedded Neo4j in a Java Application. I try to create an HelloWorld Application like the following: https://neo4j.com/docs/java-reference/current/#tutorials-java-embedded

You can find the source code here: https://github.com/neo4j/neo4j/blob/3.1/manual/embedded-examples/src/main/java/org/neo4j/examples/EmbeddedNeo4j.java

I created a new maven project and added org.neo4j:neo4j 3.0.3 as a dependency. Unfortunately I cannot import "org.neo4j.graphdb.factory.GraphDatabaseFactory", all other imports seem to be ok. Now I figured out, that the import is working for version "3.1.0-SNAPSHOT" of the neo4j dependency. Here you can find the relevant part of my pom-file:

      <dependencies>
        <dependency>
          <groupId>org.neo4j</groupId>
          <artifactId>neo4j</artifactId>
          <version>3.1.0-SNAPSHOT</version>
        </dependency>
      </dependencies>

Because I want to use a stable version, I want to achieve this with version 3.0.3 as well, but I cannot find something that this factory is depending on this version or how you should do it at version 3.0.3. Can somebody provide information about this?


Answer:

The dependency you should include in your pom.xml is

<dependency>
   <groupId>org.neo4j</groupId>
   <artifactId>neo4j</artifactId>
   <version>3.0.3</version>
</dependency>

As I see you already included the right dependency. Then I guess something went wrong during the resolving. Therefore purge your local repository and resolve the dependency again with following command

mvn dependency:purge-local-repository -Dinclude=org.neo4j:neo4j

If it's still not working you have to check if you are resolving the artifact from the maven central repository or somewhere else.

Question:

I have an very big XML file with ~50m lines.

I am trying to create a Neo4j Graph Database of the XML file.

I am using Java in NetBeans IDE to:

1) Import XML data in the Java application. 2) Create a Neo4j graph database with the data.

For step 1, I am using SAX parser which gives me the data by one xml tag at a time.

The XML tags are:

1) A conference paper. (The outer tag) 2) The conference it belongs to. (Inner tag) 3) The authors of the conference paper. (Inner tag)

I need the following nodes and relationship while creating the Neo4j graph database.

1) Create a new node for each paper. (Duplicates cannot occur as each paper is described only once) 2) Create a new node for each conference. (No duplicates should exist the graph) 3) Create a new node for each author. (No duplicates should exist the graph)

In the relationships, each paper should be connected to its conference and each author should be connected to the papers written by the author.

Example:

I know this is a very specific question but I am not expecting a perfect answer to my question, I am just looking for approaches towards solving it.

I am completely new to Neo4j.

How should I go about this problem? I was advised to use Batch Insertion, but is it possible to use it while inserting 1 value (node) at a time and also checking the conditions and constraints in the graph db while inserting it.

This is what I am thinking of: If the tag already has a node (If a conference node already exists, don't create the node, just find the node by its ID and link the paper to it) or create the node, if it doesn't exists (Create a new node for the conference if it doesn't exist and then link the paper to the new node). Same process for paper and authors. (If the author does not exist, create a new node and link the author to the paper or if the author already exist, find the node and link that node to the paper). How much time would this process take? Is it feasible to go through with this approach.

What other options do I have towards solving this problem?

Any help would be greatly appreciated.

Thanks a lot in advance.


Answer:

[UPDATED]

Let's say your data has unique IDs for papers, authors, and conferences. A minimal neo4j data model might then look something like this (which mimics the model illustrated in your question):

(:Conf {id: 111, name: 'XYZ Conference 2016'})-[:HAS_PAPER]->
  (:Paper {id: 222, name: 'The Theory of Everything'})-[:HAS_AUTHOR]->
  (:Author {id: 333, name: 'Albert Einstein IV'})

If your neo4j client fills in 3 parameters with info on each paper like this:

{
  "conf": {"id": "111", "name": "XYZ Conference 2016"},
  "authors": [
    {"id": "333", "name": "Albert Einstein IV"},
    {"id": "444", "name": "Isaac Newton XVIII"}],
  "paper": {"id": "222", "name": "The Theory of Everything"}
}

Then the query for creating the nodes and relationships for a paper would look something like this:

MERGE (c:Conf {id: {conf}.id, name: {conf}.name} ) 
CREATE (c)-[:HAS_PAPER]->(p:Paper {paper})
FOREACH (x IN {authors} |
  MERGE (a:Author {id: x.id, name: x.name})
  CREATE (p)-[:HAS_AUTHOR]->(a))

NOTE 1: The above MERGE clauses assume that conference and author names never change. If they can change, then the name properties should be set in separate SET clauses, or you could get multiple nodes for the same ID.

NOTE 2: When concurrent updates are are possible, it is also possible to get duplicate nodes with the same ID, even when everyone uses MERGE. Therefore, to prevent duplicate nodes, you should create uniqueness constraints for :Conf(id), :Author(id), and :Paper(id). neo4j will abort a query that violates such a constraint.

NOTE 3: The MERGE clause does not support setting all the properties directly from a "map", as the CREATE clause does, so MERGE clauses have to specify each property separately.

Question:

I'm trying to import a ~52GB file of nodes into neo4j using the import tool, I've run it twice and tried getting rid of duplicates in the file using the "uniq" Linux command on the file.

I'm running neo4j on a dedicated server

  • Ubuntu Server 16.04 "Xenial Xerus" LTS
  • RAM: 64GB
  • Hard drive: SoftRAID 3x2 TB Server
  • Processor: Intel Xeon E5-1620 Quad-core (4 Core) 3.60 Ghz

My config file settings:

  • dbms.memory.heap.initial_size=10G
  • dbms.memory.heap.max_size=20G
  • dbms.memory.pagecache.size=20G

The import command in using :

neo4j-admin import --database instaGraphPostPurge.db --nodes:User "/home/headers/graph_header.csv,/home/instanet/postpurge/postUniqNodes.net" --relationships:FOLLOWS "/home/headers/graph_relate_following.csv,/home/instanet/postpurge/following/graph1.net,/home/instanet/postpurge/following/graph2.net,/home/instanet/postpurge/following/graph3.net,/home/instanet/postpurge/following/graph4.net,/home/instanet/postpurge/following/graph5.net" --relationships:FOLLOWS "/home/headers/graph_relate_followedBY.csv,/home/instanet/postpurge/followedBy/graph1.net,/home/instanet/postpurge/followedBy/graph2.net,/home/instanet/postpurge/followedBy/graph3.net,/home/instanet/postpurge/followedBy/graph4.net,/home/instanet/postpurge/followedBy/graph5.net" --delimiter TAB --ignore-duplicate-nodes

The node import and sort seem to finish then I get the error of

" java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.input.InputException: Too many collisions: 2320721971"

I'm not too sure what the problem is, I've searched my node file for that ID but can't seem to find it.

I've included the output after the import.

Any help is appreciated, Thanks in advance.

IMPORT DONE in 15h 10m 10s 79ms. Data statistics is not available. Peak memory usage: 18.65 GB

    ******** DETAILS 2018-06-15 05:17:41.913+0000 ********

    Nodes
    [*Nodes---------------------------------------------------------------------------------------]2.36B
    Memory usage: 18.65 GB
    I/O throughput: 79.33 MB/s
    VM stop-the-world time: 631ms
    Duration: 47m 48s 468ms
    Done batches: 236827

    Prepare node index
    [*DETECT--------------------------------------------------------------------------------------]6.96B
    Memory usage: 29.67 GB
    Duration: 14h 22m 16s 475ms
    Done batches: 696830

    Environment information:
      Free physical memory: 3.32 GB
      Max VM memory: 13.99 GB
      Free VM memory: 285.40 MB
      VM stop-the-world time: 631ms
      Duration: 15h 10m 4s 943ms

java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.input.InputException: Too many collisions: 2320721971 at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:150) at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:142) at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:58) at java.lang.Thread.run(Thread.java:748) Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: Too many collisions: 2320721971 at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectAndMarkCollisions(EncodingIdMapper.java:451) at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:234) at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:56) at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:53) ... 1 more Error in input data Caused by:Too many collisions: 2320721971

WARNING Import failed. The store files in /home/databases/instaGraphPostPurge.db are left as they are, although they are likely in an unusable state. Starting a database on these store files will likely fail or observe inconsistent records so start at your own risk or delete the store manually org.neo4j.unsafe.impl.batchimport.input.InputException: Too many collisions: 2320721971 at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectAndMarkCollisions(EncodingIdMapper.java:451) at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:234) at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:56) at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:53) at java.lang.Thread.run(Thread.java:748) unexpected error: Too many collisions: 2320721971


Answer:

Too many collisions: 2320721971 tells you that the number of collisions is 2320721971, which is too many for the importer to handle. Now that isn't actually true any more in 3.4, where there have been collision management improvements, and so that limit is gone but the importer still makes this check for some reason.

Now, this could be fixed, but > 2.3 billion collisions is quite a lot of collisions and points to some really dirty or not-quite-cleaned-up input data. Removing this check would make it work, but the amount of memory to store those collisions could be overwhelming too, so the import might still fail when trying to allocate that memory anyway.

The best you can do is to reduce the number of duplicate node IDs in your input somehow.