Hot questions for Using Amazon S3 in multithreading

Top Java Programmings / Amazon S3 / multithreading

Question:

I'm doing some POC work with Redshift, loading data via S3 json files, using the copy command from a Java program. This POC is testing an initial data migration that we'd do to seed Redshift, not daily use. My data is split into about 7500 subfolders in S3, and I'd like to be able to insert the subfolders in parallel. Each subfolder contains about 250 json files, with about 3000 rows each to insert.

The single threaded version of my class loads files from one of my s3 subfolders in about 20 seconds (via a copy command). However, when I introduce a second thread (each thread gets a redshift db connection from a BoneCP connection pool), each copy command, except for the 1st one, takes about 40 seconds. When I run a query in Redshift to show all running queries, Redshift says that it's running two queries at the same time (as expected). However, it's as if the 2nd query is really waiting for the 1st to complete before it starts work. I expected that each copy command would still take only 20 seconds each. The Redshift console shows that I only get up to 60% CPU usage running single or double threaded.

Could this be because I only have 1 node in my Redshift cluster? Or is Redshift unable to open multiple connections to S3 to get the data? I'd appreciate any tips for how to get some performance gains by running multi-threaded copy commands.


Answer:

Amazon Redshift loads data from Amazon S3 in parallel, utilising all nodes. From your test results, it would appear that running multiple COPY commands does not improve performance, since all nodes are already involved in the copy process.

For each table, always load as many files as possible in a single COPY command, rather than appending later. If you are loading multiple tables, it is likely best to do them sequentially (but your testing might find that loading multiple smaller tables can be done in parallel).

Some references: