Hot questions for Using Amazon S3 in gzip

Question:

I have looked at both AWS S3 Java SDK - Download file help and Working with Zip and GZip files in Java.

While they provide ways to download and deal with files from S3 and GZipped files respectively, these do not help in dealing with a GZipped file located in S3. How would I do this?

Currently I have:

try {
    AmazonS3 s3Client = new AmazonS3Client(
            new ProfileCredentialsProvider());
    String URL = downloadURL.getPrimitiveJavaObject(arg0[0].get());
    S3Object fileObj = s3Client.getObject(getBucket(URL), getFile(URL));
    BufferedReader fileIn = new BufferedReader(new InputStreamReader(
            fileObj.getObjectContent()));
    String fileContent = "";
    String line = fileIn.readLine();
    while (line != null){
        fileContent += line + "\n";
        line = fileIn.readLine();
    }
    fileObj.close();
    return fileContent;
} catch (IOException e) {
    e.printStackTrace();
    return "ERROR IOEXCEPTION";
}

Clearly, I am not handling the compressed nature of the file, and my output is:

����sU�3204�50�5010�20�24��L,(���O�V�M-.NLOU�R�U�����<s��<#�^�.wߐX�%w���������}C=�%�J3��.�����둚�S�ᜑ���ZQ�T�e��#sr�cdN#瘐:&�
S�BǔJ����P�<��

However, I cannot implement the example in the second question given above because the file is not located locally, it requires downloading from S3.

What should I do?


Answer:

I solved the issue using a Scanner instead of an InputStream.

The scanner takes the GZIPInputStream and reads the unzipped file line by line:

fileObj = s3Client.getObject(new GetObjectRequest(oSummary.getBucketName(), oSummary.getKey()));
fileIn = new Scanner(new GZIPInputStream(fileObj.getObjectContent()));

Question:

Question

See updated question in edit section below

I'm trying to decompress large (~300M) GZIPed files from Amazon S3 on the fly using GZIPInputStream but it only outputs a portion of the file; however, if I download to the filesystem before decompression then GZIPInputStream will decompress the entire file.

How can I get GZIPInputStream to decompress the entire HTTPInputStream and not just the first part of it?

What I've Tried

see update in the edit section below

I suspected a HTTP problem except that no exceptions are ever thrown, GZIPInputStream returns a fairly consistent chunk of the file each time and, as far as I can tell, it always breaks on a WET record boundary although the boundary it picks is different for each URL (which is very strange as everything is being treated as a binary stream, no parsing of the WET records in the file is happening at all.)

The closest question I could find is GZIPInputStream is prematurely closed when reading from s3 The answer to that question was that some GZIP files are actually multiple appended GZIP files and GZIPInputStream doesn't handle that well. However, if that is the case here why would GZIPInputStream work fine on a local copy of the file?

Demonstration Code and Output

Below is a piece of sample code that demonstrates the problem I am seeing. I've tested it with Java 1.8.0_72 and 1.8.0_112 on two different Linux computers on two different networks with similar results. I expect the byte count from the decompressed HTTPInputStream to be identical to the byte count from the decompressed local copy of the file, but the decompressed HTTPInputStream is much smaller.

Output
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 87894 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 1772936 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 89217 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
Sample Code
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;

public class GZIPTest {
    public static void main(String[] args) throws Exception {
        // Our three test files from CommonCrawl
        URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");

        /*
         * Test the URLs and display the results
         */
        test(url0, "testfile0.wet");
        System.out.println("------");
        test(url40, "testfile40.wet");
        System.out.println("------");
        test(url500, "testfile500.wet");
    }

    public static void test(URL url, String testGZFileName) throws Exception {
        System.out.println("Testing URL "+url.toString());

        // First directly wrap the HTTPInputStream with GZIPInputStream
        // and count the number of bytes we read
        // Go ahead and save the extracted stream to a file for further inspection
        System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
        int bytesFromGZIPDirect = 0;
        URLConnection urlConnection = url.openConnection();
        FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);

        // FIRST TEST - Decompress from HTTPInputStream
        GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream());

        byte[] buffer = new byte[1024];
        int bytesRead = -1;
        while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPDirect += bytesRead;
            directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
        }
        gzipishttp.close();
        directGZIPOutStream.close();

        // Now save the GZIPed file locally
        System.out.println("Testing saving to file before decompression");
        int bytesFromGZIPFile = 0;
        ReadableByteChannel rbc = Channels.newChannel(url.openStream());
        FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
        outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
        outputStream.close();

        // SECOND TEST - decompress from FileInputStream
        GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));

        buffer = new byte[1024];
        bytesRead = -1;
        while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPFile += bytesRead;
        }
        gzipis.close();

        // The Results - these numbers should match but they don't
        System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
        System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
        System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
    }

}
Edit

Closed Stream and associated Channel in demonstration code as per comment by @VGR.

UPDATE:

The problem does seem to be something specific to the file. I pulled the Common Crawl WET archive down locally (wget), uncompressed it (gunzip 1.8), then recompressed it (gzip 1.8) and re-uploaded to S3 and the on-the-fly decompression then worked fine. You can see the test if you modify the sample code above to include the following lines:

// Original file from CommonCrawl hosted on S3
URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
// Recompressed file hosted on S3
URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");

test(originals3, "originalhost.txt");
test(rezippeds3, "rezippedhost.txt");

URL rezippeds3 points to the WET archive file that I downloaded, decompressed and recompressed, then re-uploaded to S3. You will see the following output:

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 7212400 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file originals3.txt
-----
Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file rezippeds3.txt

As you can see once the file was recompressed I was able to stream it through GZIPInputStream and get the entire file. The original file still shows the usual premature end of the decompression. When I downloaded and uploaded the WET file without recompressing it I got the same incomplete streaming behavior so it was definitely the recompression that fixed it. I also put both files, the original and the recompressed, onto a traditional Apache web server and was able to replicate the results, so S3 doesn't seem to have anything to do with the problem.

So. I have a new question.

New Question

Why would a FileInputStream behave differently than a HTTPInputStream when reading the same content. If it is the exact same file why does:

new GZIPInputStream(urlConnection.getInputStream());

behave any differently than

new GZIPInputStream(new FileInputStream("./test.wet.gz"));

?? Isn't an input stream just an input stream??


Answer:

Root Cause Discussion

It turns out that InputStreams can vary quite a bit. In particular they differ in how they implement the .available() method. For example ByteArrayInputStream .available() returns the number of bytes remaining in the InputStream. However, HTTPInputStream .available() returns the number of bytes available for reading before a blocking IO request needs to be made to refill the buffer. (See the Java Docs for more info)

The problem is that GZIPInputStream uses the output of .available() to determine if there might be an additional GZIP file available in the InputStream after it finishes decompressing a complete GZIP file. Here is line 231 from OpenJDK source file GZIPInputStream.java method readTrailer().

   if (this.in.available() > 0 || n > 26) {

If the HTTPInputStream read buffer empties right at the boundary of two concatenated GZIP files GZIPInputStream calls .available(), which responds with a 0 as it would need to go out to the network to refill the buffer, and so GZIPInputStream treats the file as complete and closes prematurely.

The Common Crawl .wet archives are hundreds of megabytes of small concatenated GZIP files and so eventually the HTTPInputStream buffer will empty right at the end of one of the concatenated GZIP files and GZIPInputStream will close prematurely. This explains the problem demonstrated in the question.

Solution and Work Around

This GIST contains a patch to jdk8u152-b00 revision 12039 and two jtreg tests that remove the (in my humble opinion) incorrect reliance on .available().

If you cannot patch the JDK a work around is to make sure that available() always returns > 0 which forces GZIPInputStream to always check for another GZIP file in the stream. Unfortunately HTTPInputStream is private so you cannot subclass it directly, instead extend InputStream and wrap the HTTPInputStream. The below code demonstrates this work around.

Demonstration Code and Output

Here is the output showing that when the HTTPInputStream is wrapped as discussed GZIPInputStream will produces identical results when reading the concatenated GZIP from a file and directly from HTTP.

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 451171329 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 453183600 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet

Here is the demonstration code from the question modified with an InputStream wrapper.

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;

public class GZIPTest {
    // Here is a wrapper class that wraps an InputStream
    // but always returns > 0 when .available() is called.
    // This will cause GZIPInputStream to always make another 
    // call to the InputStream to check for an additional 
    // concatenated GZIP file in the stream.
    public static class AvailableInputStream extends InputStream {
        private InputStream is;

        AvailableInputStream(InputStream inputstream) {
            is = inputstream;
        }

        public int read() throws IOException {
            return(is.read());
        }

        public int read(byte[] b) throws IOException {
            return(is.read(b));
        }

        public int read(byte[] b, int off, int len) throws IOException {
            return(is.read(b, off, len));
        }

        public void close() throws IOException {
            is.close();
        }

        public int available() throws IOException {
            // Always say that we have 1 more byte in the
            // buffer, even when we don't
            int a = is.available();
            if (a == 0) {
                return(1);
            } else {
                return(a);
            }
        }
    }



    public static void main(String[] args) throws Exception {
        // Our three test files from CommonCrawl
        URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");

        /*
         * Test the URLs and display the results
         */
        test(url0, "testfile0.wet");
        System.out.println("------");
        test(url40, "testfile40.wet");
        System.out.println("------");
        test(url500, "testfile500.wet");
    }

    public static void test(URL url, String testGZFileName) throws Exception {
        System.out.println("Testing URL "+url.toString());

        // First directly wrap the HTTP inputStream with GZIPInputStream
        // and count the number of bytes we read
        // Go ahead and save the extracted stream to a file for further inspection
        System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
        int bytesFromGZIPDirect = 0;
        URLConnection urlConnection = url.openConnection();
        // Wrap the HTTPInputStream in our AvailableHttpInputStream
        AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream());
        GZIPInputStream gzipishttp = new GZIPInputStream(ais);
        FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
        int buffersize = 1024;
        byte[] buffer = new byte[buffersize];
        int bytesRead = -1;
        while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) {
            bytesFromGZIPDirect += bytesRead;
            directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
        }
        gzipishttp.close();
        directGZIPOutStream.close();

        // Save the GZIPed file locally
        System.out.println("Testing saving to file before decompression");
        ReadableByteChannel rbc = Channels.newChannel(url.openStream());
        FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
        outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);

        // Now decompress the local file and count the number of bytes
        int bytesFromGZIPFile = 0;
        GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));

        buffer = new byte[1024];
        while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPFile += bytesRead;
        }
        gzipis.close();

        // The Results
        System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
        System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
        System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
    }

}

Question:

new BufferedReader(new InputStreamReader(
       new GZIPInputStream(s3Service.getObject(bucket, objectKey).getDataInputStream())))

creates Reader that returns null from readLine() after ~100 lines if file is greater then several MB. Not reproducible on gzip files less then 1 MB. Does anybody knows how to handle this?


Answer:

From the documentation of BufferedReader#readLine():

Returns:

A String containing the contents of the line, not including any line-termination characters, or null if the end of the stream has been reached

I would say it pretty clear what this means: The end of the file/stream has been encountered - no more data is available.

Notable quirks with the GZIP format: Multiple files can just be appended to one-another to create a larger file with multiple gzipped objects. It seems that the GZIPInputStream only reads the first of those.

That also explains why it is working for "small files". Those contain only one zipped object, so the whole file is read.

Note: If the GZIPInputStream determines undestructively that one gzip-file is over, you could just open another GZIPInputStream on the same InputStream and read multiple objects.

Question:

I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.

public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
            System.out.println("Unzipping.............");
            GZIPInputStream gzis = null;
            try {
                gzis = new GZIPInputStream(in);
                InputStreamReader reader = new InputStreamReader(gzis);
                BufferedReader br = new BufferedReader(reader);
                double mb = 0;
                String readed;
                int i=0;
                while ((readed = br.readLine()) != null) {
                     mb = mb+readed.getBytes().length / (1024*1024);
                     i++;
                     if(i%100==0) {System.out.println(mb);}
                }


            } catch (IOException e) {
                e.printStackTrace();
                LOG.error("Invoked AWSUtils getS3Content : json ", e);
            } finally {
                closeStreams(gzis, in);
            }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) at java.lang.StringBuffer.append(StringBuffer.java:367) at java.io.BufferedReader.readLine(BufferedReader.java:370) at java.io.BufferedReader.readLine(BufferedReader.java:389) at com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)


Answer:

This is a theory, but I can't think of any other reasons why your example would OOM.

Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.

Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.

Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.

  • Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.

  • Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.

  • So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.

  • So 650MB could easily turn into a heap demand of > 6 x 650M bytes

The other thing to note that the 2 x N array has to be a single contiguous heap node.

Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.


So what is the solution?

It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.

    public static String unzip(InputStream in) 
            throws IOException, CompressorException, ArchiveException {
        System.out.println("Unzipping.............");
        try (
            GZIPInputStream gzis = new GZIPInputStream(in);
            InputStreamReader reader = new InputStreamReader(gzis);
            BufferedReader br = new BufferedReader(reader);
        ) {
            int ch;
            long i = 0;
            while ((ch = br.read()) >= 0) {
                 i++;
                 if (i % (100 * 1024 * 1024) == 0) {
                     System.out.println(i / (1024 * 1024));
                 }
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOG.error("Invoked AWSUtils getS3Content : json ", e);
        }

Question:

I'd like to take my input stream and upload gzipped parts to s3 in a similar fashion to the multipart uploader. However, I want to store the individual file parts in S3 and not turn the parts into a single file.

To do so, I have created the following methods. But, when I try to gzip decompress each part gzip throws an error and says: gzip: file_part_2.log.gz: not in gzip format.

I'm not sure if I am compressing each part correctly?

If I re-initialise the gzipoutputstream: gzip = new GZIPOutputStream(baos); and set gzip.finish() after reseting the byte array output stream baos.reset(); then I am able to decompress each part. Not sure why I need todo this, is there a similar reset for the gzipoutputstream?

public void upload(String bucket, String key, InputStream is, int partSize) throws Exception
{
    String row;
    BufferedReader br = new BufferedReader(new InputStreamReader(is, ENCODING));
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    GZIPOutputStream gzip = new GZIPOutputStream(baos);

    int partCounter = 0;
    int lineCounter = 0;
    while ((row = br.readLine()) != null) {
        if (baos.size() >= partSize) {
            partCounter = this.uploadChunk(bucket, key, baos, partCounter);

            baos.reset();
        }else if(!row.equals("")){
            row += '\n';
            gzip.write(row.getBytes(ENCODING));
            lineCounter++;
        }
    }

    gzip.finish();
    br.close();
    baos.close();

    if(lineCounter == 0){
        throw new Exception("Aborting upload, file contents is empty!");
    }

    //Final chunk
    if (baos.size() > 0) {
        this.uploadChunk(bucket, key, baos, partCounter);
    }
}

private int uploadChunk(String bucket, String key, ByteArrayOutputStream baos, int partCounter)
{
    ObjectMetadata metaData = new ObjectMetadata();
    metaData.setContentLength(baos.size());

    String[] path = key.split("/");
    String[] filename = path[path.length-1].split("\\.");

    filename[0] = filename[0]+"_part_"+partCounter;

    path[path.length-1] = String.join(".", filename);

    amazonS3.putObject(
            bucket,
            String.join("/", path),
            new ByteArrayInputStream(baos.toByteArray()),
            metaData
    );

    log.info("Upload chunk {}, size: {}", partCounter, baos.size());

    return partCounter+1;
}

Answer:

The problem is that you're using a single GZipOutputStream for all chunks. So you're actually writing pieces of a GZipped file, which would have to be recombined to be useful.

Making the minimal change to your existing code:

if (baos.size() >= partSize) {
    gzip.close(); 
    partCounter = this.uploadChunk(bucket, key, baos, partCounter);
    baos = baos = new ByteArrayOutputStream();
    gzip = new GZIPOutputStream(baos);
}

You need to do the same at the end of the loop. Also, you shouldn't throw an exception if the line counter is 0: it's entirely possible that the file is exactly divisible into a set number of chunks.

To improve the code, I would wrap the GZIPOutputStream in an OutputStreamWriter and a BufferedWriter, so that you don't need to do the string-bytes conversion explicitly.

And lastly, don't use ByteArrayOutputStream.reset(). It doesn't save you anything over just creating a new stream, and opens the door for errors if you ever forget to reset.

Question:

I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 
    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {
        object = s3.getObject(new GetObjectRequest(bucketName, key));

        gzipStream = new GZIPInputStream(object.getObjectContent());
        decoder = new InputStreamReader(gzipStream);
        buffered = new BufferedReader(decoder);

        sitemaps = new ArrayList<>();

        String line = buffered.readLine();

        while (line != null) {
            if (line.matches("Sitemap:.*")) {
                sitemaps.add(line);
            }
            line = buffered.readLine();
        }

Answer:

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root
  /a
    f1.gz
    f2.gz
  /b
    f3.gz

or even this:

/root
  f3.gz
  /a
    f1.gz
    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

Question:

I have a set of .gz compressed files in s3 bucket. I want to download the csv file inside the .gz file. I tried to extract the .gz file and put it into the s3Object. Now i need to extract the s3 object and download the csv file inside it using java. Please advise.This is the code I used.Now i am able to download gz file. But I need to download csv file inside gz.

S3Object object = s3Client.getObject("bucket","Location/file.gz");
final String encoding = null;
return ResponseEntity.ok(IOUtils.toString(object.getObjectContent(), encoding));

I need help in unzipping the gz file in s3object and return the decompressed contents in the response.


Answer:

The below code will convert your gunzip file into plain data, but I'm not sure 100% about your actual issue, whether you want to display the content in browser itself or you want to send it as Save as Option, that's why I did a minimum code change to your code assuming you have only problem in converting gunzip format to CSV data, hope you could modify/enhance it that suits you best.

import java.util.zip.GZIPInputStream;
//your method begins from here
final AmazonS3 s3 = AmazonS3ClientBuilder.defaultClient();


S3Object object = s3.getObject("your-bucket", "file-path");


return ResponseEntity.ok(IOUtils.toString(new GZIPInputStream(object.getObjectContent())));