Hot questions for Using Amazon S3 in inputstream

Question:

I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.

As per AWS documentation to read from file:

fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
 displayTextInputStream(fullObject.getObjectContent());

private static void displayTextInputStream(InputStream input) throws IOException {
    // Read the text input stream one line at a time and display each line.
    BufferedReader reader = new BufferedReader(new InputStreamReader(input));
    String line = null;
    while ((line = reader.readLine()) != null) {
        System.out.println(line);
    }
    System.out.println();
}

Here we are using a BufferedReader. It is not clear to me what is happening underneath here.

Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?


Answer:

One of the answer of your question is already given in the documentation you linked:

Your network connection remains open until you read all of the data or close the input stream.

A BufferedReader doesn't know where the data it reads is coming from, because you're passing another Reader to it. A BufferedReader creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader before starting to handing out data of calls of read() or read(char[] buf).

The Reader you pass to the BufferedReader is - by the way - using another buffer for itself to do the conversion from a byte-based stream to a char-based reader. It works the same way as with BufferedReader, so the internal buffer is filled by reading from the passed InputStream which is the InputStream returned by your S3-client.

What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.

The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine are not leading to single network calls.

And to answer your other question: No, a BufferedReader, the InputStreamReader and most likely the InputStream returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][] instead (to come around the limit of 2^32 bytes per byte-array)

Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.

Question:

I have a service method which returns a File Object from AWS S3 bucket given a fileName. I would like to convert the S3Object to a File object and return the same.

I do not want to use any temp location to download the file, just want to convert the S3Object to File Object and return it.

I would also like to use try with resources and IOUtils for the same.

My question is what is the right way to perform com.amazonaws.util.IOUtils.copy(InputStream in, OutputStream out) method to get the File Object. This File object will be used by the calling method for further processing hence I do not want to save it locally. I tried something like this but it throws FileNotFoundException.

I am sure I am missing something in this, may be IOUtils.copy has to be used differently. or I am not using try with resources correctly.

    public File getFileFromBucket(String fileName) {
        GetObjectRequest getObjectRequest = new GetObjectRequest(aWSBucketName, fileName);
        S3Object s3Object = aWSS3client.getObject(getObjectRequest);
        File s3File = new File(fileName);
        try (FileOutputStream fos = new FileOutputStream(s3File)) { //throws Exception
            IOUtils.copy(s3Object.getObjectContent(), fos); 
        } catch (IOException e) {
            log.debug("IOException Occurred while fetching file {}", fileName);
            e.printStacktrace();
        }
        return s3File;
    }

What is the right way to return the File? Any help is appreciated, thanks in advance.

Edit

If I do not want to save a File Object locally, is it advisable to return the InputStream to the calling method?

As mentioned, the calling method uses apache poi library to parse the excel file.

org.apache.poi.xssf.usermodel.XSSFWorkbook.XSSFWorkbook(InputStream is) this method anyway takes InputStream as a param.

I do not know why I wanted to convert it to a File to begin with.

Question:

Is it advisable to send the S3Object.getObjectContent() as an InputStream to the calling method or is there a better way to do it?


Answer:

After consulting my seniors I found that it is fine to send the InputStream to the calling method.

If the InputStream needs to be save to a local system, then IOUtils.copy can be used whenever necessary.

Conversion of InputStream to File and then back to InputStream is a waste of computation.

Hence my method is simply returning the ObjectContent like this:

public InputStream getFileInputStreamFromBucket(String fileName) {
        GetObjectRequest getObjectRequest = new GetObjectRequest(aWSBucketName, fileName);
        S3Object s3Object = aWSS3client.getObject(getObjectRequest);
        InputStream fileInputStream = s3Object.getObjectContent();
        log.debug("File Input Stream fetched from s3 bucket for File {} ", fileName);
        return fileInputStream;
    }

This seems to be the right choice.

If anyone has a better way to send back the InputStream please feel free to Post an answer.

Question:

I'm currently working on some code that uploads multi-part objects to S3, and I am running into this error:

Caused by: com.amazonaws.ResetException: Failed to reset the request input stream;  If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)

Originally the readLimit was set to 5MB. I had changed the code so that the ReadLimit on the input stream would be the Object Size rounded up to the nearest 5MB (With a 5GB cap since thats the AWS limit). This seemed to fix the issue but now the same error is showing up in new places.

Does anyone have any suggestions for what value to set the readLimit at for the most reliability?

Any help would be appreciated,

Thanks

Ted


Answer:

For those looking for an answer, the solution is to use a RetryPolicy with a BackOffStrategy. A backoffstrategy slowly increases the amount of time inbetween connection attempts.

http://docs.aws.amazon.com/general/latest/gr/api-retries.html

Furthermore, if you use a backoffstrategy you need to use a compatible FileStreamer which can Mark/Reset when uploading data.

https://github.com/awsdocs/aws-java-developer-guide/blob/master/doc_source/best-practices.rst

Question:

I'm developing an Spring Boot Application, that should allow users to download files indirectly from Amazon S3 via specified application REST interface. For this purpose I have an REST-Controller, that returns an InputStreamResource to the user like following:

@GetMapping(path = "/download/{fileId}")
public ResponseEntity<InputStreamResource> downloadFileById(@PathVariable("fileId") Integer fileId) {
    Optional<LocalizedFile> fileForDownload = fileService.getConcreteFileForDownload(fileId);

    if (!fileForDownload.isPresent()) {
        return ResponseEntity.notFound().build();
    }

    return ResponseEntity.ok()
            .header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=\"" + fileForDownload.get().getFilename())
            .body(new InputStreamResource(fileService.download(fileForDownload.get())));
}

Download method in file service looks like this:

@Override
public InputStream download(LocalizedFile file) {
    S3Object obj = s3client.getObject(bucketName, file.getFilename());
    return obj.getObjectContent();
}

My concern here is that this input stream from Amazon SDK could not be closed explicitly in controller. There is a following warning in AWS documentation of the getObjectContent() method, that makes me suspicious about success of my approach described above:

If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3. Further, failure to close this stream can cause the request pool to become blocked.

Therefore my question: Is it safe to return an ResponseEntity<InputStreamResource> in controller? Will this InputStream from S3Object.getObjectContent() be closed automatically after successful download? So far my approach has worked successfully, but I'm not so sure about possible future consequences.


Answer:

After some research I found the answer, that should be applicable to my question.

Tl;dr version: Spring MVC handles the closing of the given input stream, so my approach described above should be safe.

Question:

I have a png image file in an AWS S3 bucket. I'm trying to get this image using Java SDK. So far, here is what I have done:

public String encodeBase64URL(BufferedImage imgBuf) throws IOException {
    String base64;

    if (imgBuf == null) {
        base64 = null;
    } else {
        Base64 encoder = new Base64();
        ByteArrayOutputStream out = new ByteArrayOutputStream();

        ImageIO.write(imgBuf, "PNG", out);

        byte[] bytes = out.toByteArray();
        base64 = "data:image/png;base64," + encoder.encode(bytes)
                .toString();
    }

    return base64;
}

public String saveImageToS3(BufferedImage imgBuf, String id) throws IOException {
    AmazonS3 s3 = new AmazonS3Client();

    File file = File.createTempFile(id, ".png");
    file.deleteOnExit();

    ImageIO.write(imgBuf, "png", file);

    s3.putObject(new PutObjectRequest(bucketName, id, file));
    file.delete();

    return bucketName;
}

public String downloadImageFromS3(String id) throws IOException {
    AmazonS3 s3 = new AmazonS3Client();

    S3Object obj = s3.getObject(new GetObjectRequest(bucketName, id));
    BufferedImage imgBuf = ImageIO.read(obj.getObjectContent());

    String base64 = encodeBase64URL(imgBuf);
    return base64;
}

The saveImageToS3 method works exactly as expected. However, the second method returns data:image/png;base64,[B@428d24fa while it should have been a valid Base64.

What am I doing wrong? Please help.


Answer:

Instead of calling toString() on the resulting byte[] which will result in the very unhelpful [B@428d24fa format, you need to create a new String using the returned bytes:

base64 = "data:image/png;base64," + new String(encoder.encode(bytes), "UTF-8");

Question:

I'm getting files from an FTP. The files I'm getting are either text files or tar.gz

For text files I just send them to S3. If I encounter a tar.gz I want to untar it and save each file with the same method.

public void handleFile() {

    try (InputStream fileStream = ftpFileService.getInputStream(file)) {
        if (file.getName().lastIndexOf(".TGZ") > -1) {
            TarArchiveInputStream tar = new TarArchiveInputStream(new GzipCompressorInputStream(fileStream));
            TarArchiveEntry entry = null;
            while ((entry = tar.getNextTarEntry()) != null) {
                LOGGER.info("fileName to save {}", entry.getName());
                saveStreamToS3(entry.getName(), new InputStream(tar));
            }
            tar.close();
        } else {
            LOGGER.info("fileName to save {}", fileName.getName());
            saveStreamToS3(fileName.getName(), fileStream);
        }
    } catch (IOException e) {
        LOGGER.error(e.getMessage());
    }
}

I tried saving the entry directly using new FileInputStream(entry.getFile()) but this returns null.

Do I need to make a saveTarStreamToS3() or can I make an InputStream out of a TarArchiveInputStream?


Answer:

FileInputStream only reads real files. It doesn't read data from inside an archive.

There are two possible solutions

  • Use InputStream which FileInputStream and TarArchiveInputStream impliements
  • Copy the file to disk, read using FileInputStream, delete afterwards.

The purpose on the interface InputStream is so you don't need to know where the data is coming from, and this is the natural way to solve this.

can I make an InputStream out of a TarArchiveInputStream

TarArchiveInputStream implements InputStream so there is nothing to do.

Question:

I would like to download big file from Amazon S3 into RAM. File is bigger then RAM size. Seems, I need to load it by parts. Each part would be return in endpoint. Also I can not use hard drive, to store downloaded file there. I have InputStream object and I am trying to load object like below:

    inputStream.skip(totalBytes);
    long downloadedBytesCount = 0;
    ByteArrayOutputStream result = new ByteArrayOutputStream();
    byte[] buffer = new byte[1024];
    int length;
    do {
        length = inputStream.read(buffer);
        result.write(buffer, 0, length);
        downloadedBytesCount += length;
    }
    while (downloadedBytesCount <= partOfFileSize && (length != -1));
    totalBytes += downloadedBytesCount;

but that code contains problems: each new request will start download file from the begin,so last request for downloading (for example 20 MB) will download all the file (for example 1 GB). So, method skip(long) doesn't work as I expected.

How can I download file from inputStream by parts? Any suggestions?


Answer:

The standard S3 library can transfer whatever parts of the file you want:

(taken from the AWS docs)

GetObjectRequest rangeObjectRequest = new GetObjectRequest(
        bucketName, key);
rangeObjectRequest.setRange(0, 10); // retrieve 1st 11 bytes.
S3Object objectPortion = s3Client.getObject(rangeObjectRequest);

InputStream objectData = objectPortion.getObjectContent();

In your program you could, for example, read 1000 bytes at a time by moving the range.

Question:

I'm currently getting an input stream using Amazon's S3 storage like below:

public static InputStream getResourceAsStream(String fileKey) {
    AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
    return s3Client.getObject(new GetObjectRequest(bucketName, fileKey)).getObjectContent();
}

The files I'm dealing with are .mp4 videos. I'm attempting to stream the video using Range-Requests to allow for fast forwarding and rewinding for the end user. My streaming was working fine when I was using regular File objects, but now I'm receiving a stream from Amazon S3 instead of using a file saved on a local machine.

How can I figure out the length of this stream? Before, I did this:

Long length = Files.size(filepath); 

But now this does not work as I do not have a file saved directly on the machine running the Java code. Is there any way to figure out the length of the InputStream?


Answer:

You can't get the length of an InputStream unless you read it from start to end and count the number of bytes you've read yourself.

You can get the length of an S3Object though:

public static InputStream getResourceAsStream(String fileKey) {
    AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
    S3Object obj = s3Client.getObject(new GetObjectRequest(bucketName, fileKey));
    long length = obj.getObjectMetadata().getInstanceLength();
    //use length
    return obj.getObjectContent();
}

Question:

I am processing a huge CSV (1GB) using java code.

My Application is Running on 2 Core Machine with 8GB memory.

I am using below command to start my application.

java -Xms4g -Xmx6g  -cp $CLASSPATH JobSchedulerService

Applcation starts a thread to dwonload CSV from S3 and process it. Application works file for some time but OutOfMemoryError half way processing the file.

I am looking for a way where I can continue to process the CSV file and at the same time keep my memory usage low.

in CSV process I am performing following Steps:

 //Step 1: Download FROM S3
String bucketName = env.getProperty(AWS_S3_BUCKET_NAME);
AmazonS3 s3Client = new AmazonS3Client(credentialsProvider);
S3Object s3object = s3Client.getObject(new GetObjectRequest(bucketName, key));
InputStream inputSteam =  s3object.getObjectContent();   //This Stream contains about 1GB of data

//Step 2: Parse CSV to Java
ObjectReader oReader = CSV_MAPPER.readerFor(InboundProcessing.class).with(CSV_SCHEMA);
try (FileOutputStream fos = new FileOutputStream(outputCSV, Boolean.FALSE)) {
    SequenceWriter sequenceWriter = CsvUtils.getCsvObjectWriter(InboundProcessingDto.class).writeValues(fos);
    MappingIterator<T>  mi = oReader.readValues(inputStream)

    while (mi.hasNextValue()) {
        InboundProcessing inboundProcessing = mi.nextValue();
        inboundProcessingRepository.save(inboundProcessing);   // this is Spring JPA Entity Save operation. (Almost 3M records  so 3M calls)                    
        sequenceWriter.write(inboundProcessingDto);  // this is writing to a CSV file on local file system which is uploaded to S3 in next Step
    }
} catch (Exception e) {
    throw new FBMException(e);
}

Answer:

1) Split the big-size file into small-size files.

2) Process each files one by one sequentially or parallel.

Check link to split file in small size: https://stackoverflow.com/a/2356156/8607192

Or

Use Unix command "split for split based on size".