Hot questions for Using Amazon S3 in rest

Question:

I have a server generate AWS S3 pre-signed PUT URLs and then I'm trying to uploading a byte[] into that URL using RestTemplate with this code:

RestTemplate restTemplate = new RestTemplate();
HttpHeaders headers = new HttpHeaders();
headers.setAccept(Arrays.asList(MediaType.ALL));
HttpEntity<byte[]> entity = new HttpEntity<>("Testing testing testing".getBytes(), headers);
System.out.println(restTemplate.exchange(putUrl, HttpMethod.PUT, entity, String.class));

When I run that code, I get this error:

Exception in thread "JavaFX Application Thread" org.springframework.web.client.HttpClientErrorException: 400 Bad Request
    at org.springframework.web.client.DefaultResponseErrorHandler.handleError(DefaultResponseErrorHandler.java:63)
    at org.springframework.web.client.RestTemplate.handleResponse(RestTemplate.java:700)
    at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:653)
    at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:613)
    at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:531)
    at tech.dashman.dashman.controllers.RendererAppController.lambda$null$2(RendererAppController.java:95)

Unfortunately, there's nothing in the AWS S3 logs, so, I'm not sure what's going on. If I take that exact same URL and put it in the REST Client of IntelliJ IDEA, it just works (it creates an empty file in S3).

Any ideas what's wrong with my Java code?

Here's a full example that does the signing and tries to uploading a small payload to S3:

import com.amazonaws.HttpMethod;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GeneratePresignedUrlRequest;
import org.joda.time.DateTime;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.web.client.RestTemplate;
import java.util.Date;

public class S3PutIssue {
    static public void main(String[] args) {
        String awsAccessKeyId = "";
        String awsSecretKey = "";
        String awsRegion = "";
        String path = "";
        String awsBucketName = "";
        BasicAWSCredentials awsCredentials = new BasicAWSCredentials(awsAccessKeyId, awsSecretKey);
        AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(awsRegion).
                withCredentials(new AWSStaticCredentialsProvider(awsCredentials)).build();
        Date expiration = new DateTime().plusDays(1).toDate();
        GeneratePresignedUrlRequest urlRequest = new GeneratePresignedUrlRequest(awsBucketName, path);
        urlRequest.setMethod(HttpMethod.PUT);
        urlRequest.setExpiration(expiration);
        String putUrl = s3Client.generatePresignedUrl(urlRequest).toString();

        RestTemplate restTemplate = new RestTemplate();
        HttpHeaders headers = new HttpHeaders();
        HttpEntity<byte[]> entity = new HttpEntity<>("Testing testing testing".getBytes(), headers);
        restTemplate.exchange(putUrl, org.springframework.http.HttpMethod.PUT, entity, Void.class);
    }
}

Answer:

The source of issue is a double encoding of url characters. There are / in extended secret key which are encoded as %2 by s3Client.generatePresignedUrl. When already encoded string is passed to restTemplate.exchange it's internally converted to URI and encoded for the second time as %252 by UriTemplateHandler in RestTemplate source code.

@Override
@Nullable
public <T> T execute(String url, HttpMethod method, @Nullable RequestCallback requestCallback,
        @Nullable ResponseExtractor<T> responseExtractor, Object... uriVariables) throws RestClientException {

    URI expanded = getUriTemplateHandler().expand(url, uriVariables);
    return doExecute(expanded, method, requestCallback, responseExtractor);
}

So the easiest solution is to convert URL to URI using URL.toURI(). If you don't have URI and have String when RestTemplate is invoked then two options are possible.

Pass URI instead for string to exchange method.

restTemplate.exchange(new URI(putUrl.toString()), HttpMethod.PUT, entity, Void.class);

Create default UriTemplateHandler with NONE encoding mode and pass it to RestTemplate.

DefaultUriBuilderFactory defaultUriBuilderFactory = new DefaultUriBuilderFactory();
defaultUriBuilderFactory.setEncodingMode(DefaultUriBuilderFactory.EncodingMode.NONE);
restTemplate.setUriTemplateHandler(defaultUriBuilderFactory);
restTemplate.exchange(putUrl.toString(), org.springframework.http.HttpMethod.PUT, entity, Void.class);

Question:

Looking for some insight on how to configure Secor to output fatter files that are partitioned by datetime rather than kafka offset. Something akin to hourly backups of kafka topic streams. Currently, my common.properties file contains these secor configs:

secor.generation=1
secor.consumer.threads=7
secor.messages.per.second=10000
secor.offsets.per.partition=10000000
secor.topic_partition.forget.seconds=600
secor.local.log.delete.age.hours=-1
    secor.file.reader.writer.factory=com.pinterest.secor.io.impl.SequenceFileReaderWriterFactory
secor.max.message.size.bytes=100000

This file mentions that a partition could describe the date of a message:

LogFilePath.java:

(line 29) Log file path has the following form: prefix/topic/partition1/.../partitionN/generation_kafkaParition_firstMessageOffset

(line 34) "partition1, ..., partitionN is the list of partition names extracted from message content. * E.g., the partition may describe the message date such as dt=2014-01-01 [...]"


Answer:

Secor's Readme File: JSON date parser: parser that extracts timestamps from JSON messages and groups the output based on the date, similar to the Thrift parser above. To use this parser, start Secor with properties file secor.prod.partition.properties and set secor.message.parser.class=com.pinterest.secor.parser.JsonMessageParser. You may override the field used to extract the timestamp by setting the message.timestamp.name property.

Question:

I'm trying to return images from my S3 bucket. Everything works perfect, my service uploads and gets images, but the problem is that when I try to return them through the REST Controller, the image doesn't appear on browser.

Here is where I fetch the image from S3:

public S3Object getImageFromS3Bucket(String fileName) {
    S3Object object = s3client.getObject(new GetObjectRequest(bucketName, fileName));
    return object;
}

public byte[] getByteArrayFromImageS3Bucket(String fileName) throws IOException {
    InputStream in = getImageFromS3Bucket(fileName).getObjectContent();
    byte[] byteArray = IOUtils.toByteArray(in);
    in.close();

    return byteArray;
}

and here is my controller:

@RequestMapping(value = "/getImage/{fileName}", method = RequestMethod.GET)
@ResponseStatus(value = HttpStatus.OK)
@ResponseBody
public ResponseEntity<byte[]> downloadImage(@PathVariable("fileName") String fileName) throws IOException {
    byte[] media = s3BucketTestService.getByteArrayFromFile(fileName);
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.IMAGE_PNG);
    headers.setContentLength(media.length);

    return new ResponseEntity<>(media, headers, HttpStatus.OK);
}

Here are the response headers:

Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Content-Length: 10645
Content-Type: image/png
Date: Mon, 14 May 2018 09:41:47 GMT
Expires: 0
Pragma: no-cache
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block

I also have this bean in my configuration:

@Bean
public ByteArrayHttpMessageConverter byteArrayHttpMessageConverter() {
    ByteArrayHttpMessageConverter arrayHttpMessageConverter = new ByteArrayHttpMessageConverter();
    arrayHttpMessageConverter.setSupportedMediaTypes(getSupportedMediaTypes());
    return arrayHttpMessageConverter;
}

private List<MediaType> getSupportedMediaTypes() {
    List<MediaType> list = new ArrayList<MediaType>();
    list.add(MediaType.IMAGE_JPEG);
    list.add(MediaType.IMAGE_PNG);
    list.add(MediaType.APPLICATION_OCTET_STREAM);
    return list;
}

Anyone knows what am I doing wrong? The browser creates a png image with the exact size of the file I'm trying to get, so I guess the only problem is the visualization of the image?

PD: I don't want to make the image downloadable, I just want to generate that resource so the front-end can read it.


Answer:

Try Using ImageIO API to convert the raw bytes from S3 bucket to Image Bytes as below:

public byte[] getByteArrayFromImageS3Bucket(String fileName) throws IOException {
    InputStream in = getImageFromS3Bucket(fileName).getObjectContent();

    BufferedImage imageFromAWS = ImageIO.read(in);
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ImageIO.write(imageFromAWS, "png", baos );
    byte[] imageBytes = baos.toByteArray();
    in.close();
    return imageBytes;

}

Question:

val sc = new SparkContext(conf)

val streamContext = new StreamingContext(sc, Seconds(1))

val log = Logger.getLogger("sqsLog")
val sqs = streamContext.receiverStream(new SQSReceiver("queue")
  .at(Regions.US_EAST_1)
  .withTimeout(5))


val jsonRows = sqs.mapPartitions(partitions => {
  val s3Client = new AmazonS3Client(new BasicCredentialsProvider(sys.env("AWS_ACCESS_KEY_ID"), sys.env("AWS_SECRET_ACCESS_KEY")))

  val txfm = new LogLine2Json
  val log = Logger.getLogger("parseLog")
  val sqlSession = SparkSession
    .builder()
    .getOrCreate()

  val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
  val parsedDate = parsedFormat.format(new java.util.Date())
  val outputPath = "/tmp/spark/presto"

  partitions.map(messages => {
    val sqsMsg = Json.parse(messages)
    System.out.println(sqsMsg)

    val bucketName = Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "")
    val key = Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
    System.out.println(bucketName)
    System.out.println(key)
    val obj = s3Client.getObject(new GetObjectRequest(bucketName, key))
    val stream = obj.getObjectContent()
    scala.io.Source.fromInputStream(stream).getLines().map(line => {
        try{
          val str = txfm.parseLine(line)
          val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(str)
          jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath)
        }
        catch {
          case e: Throwable => {log.info(line); "";}
        }
      }).filter(line => line != "{}")
    })
})

streamContext.start()
streamContext.awaitTermination()

My job is really simple we take a S3 key from SQS. The content of the file is nginx log and we parse that using our parser which is working file. LogLine2Json It's converting the log to JSON format then we will write that to orc format.

But I'm getting this error

java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:163)
    at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:513)
    at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:573)
    at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
    at SparrowOrc$.main(sparrowOrc.scala:159)
    at SparrowOrc.main(sparrowOrc.scala)

I understand that Spark needs an action otherwise it won't work. But I have this code to write to an orc file. I'm not sure if I have to do anything else?

jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath)

Answer:

First of all map is not an action. It is a transformation. Spark has no reason to execute this code.

Next, you should avoid side effects in transformations, and you should never use these, if correctness of the output is required.

Finally using standard io functions in distributed systems is typically meaningless.

Overall you should review existing options for DStream sinks, and if none these is suitable in your scenario, write your own using an action (foreach, foreachPartition).

Question:

I am trying to create a rest api that takes as a variable path a path from s3 (aws) and I have some problems.

  1. GetMapping(value="files/{filePath}", produces="application/json") If I give for example the path: myFiles/uni/mymarks.txt, it does not treat it as a variable but as a full path so it cannot map it. Any recommendations of how i can give it as a variable? (slashes create this issue).

  2. If I have in the filePath variable something like name.txt it keeps only the name and removes the .txt so the name is not correct when I want to use it later on the code. Any adjustments?


Answer:

How about?

@RequestMapping(path = "/files/**", method = RequestMethod.GET)
public ResponseEntity<String> s3ProxyGet(HttpServletRequest request) {
    String path = new UrlPathHelper().getPathWithinApplication(request);
    ...

Then:

curl http://localhost:8080/files/uni/mymarks.txt

will map to this controller method and path will contain /files/uni/mymarks.txt. You can then trim /files from the front and perform the fetch from s3.

UrlPathHelper is from the spring-web library.

HTH

Question:

I have a rest service that streams content to a AWS S3 bucket. I was wondering what are the alternatives to encrypt that stream to the bucket, my requirements would be primary that the encryption key is automanaged (for example KMS) and then performance wise.

Is it possible to encrypt inputstreams with KMS without having to use byte[] and buffer the whole content in-memory ?

I am planning to upload some large files > 1GB (that's why the streaming to avoid OOM errors) what would be the preferred advice for this, is there any significant difference compared to small files < 10MB ?

Thanks


Answer:

I assume you are using this version of PutObjectRequest that takes an InputStream.

  1. The data transmission to S3 is going over SSL, so the network transmission is encrypted.

  2. To use a KMS key to encrypt the data stored on S3 you can simply specify the KMS key you want S3 to use. After creating the PutObjectRequest, simply call: putObjectRequest.withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(keyID)); before the call to s3Client.putObject()

Alternatively, if you want to do all the encryption on the client-side, you can use the AmazonS3EncryptionClient and EncryptedPutObjectRequest classes in the Java AWS SDK.

Question:

I need to make a couple of services that will talk to both Amazon S3 and Riak CS.

They will handle the same operations, e.g. retrieve images.

Due to them returning different objects, in S3's case an S3Object. Is the proper way to design this to have a different class for each without a common interface?

I've been thinking on how to apply a common interface to both, but the return type of the methods is what's causing me some issues, due to them being different. I might just be going wrong about this and probably should just separate them but I am hoping to get some clarification here.

Thanks all!


Answer:

Typically, you do this by wrapping the responses from the various external services with your own classes that have a common interface. You also wrap the services themselves, so when you call your service wrappers they all return your wrapped data classes. You've then isolated all references to the external service into one package. This also makes it easy to add or remove services.

Question:

I work an API that should read all the images stored in the Amazon S3 bucket and deliver to the front-end to display (and further operations). I can use the code to access all the images stored but what is the correct format to deliver them to the FE?

Mt code is provided here,

@GetMapping(value = "/findAllImages")
    public ResponseEntity<List<String>> findNamesOfAllImages() {

        List<String> names = new ArrayList<>();

        String password = ProcBuilder.run(
                "security",
                "-i", "find-generic-password",
                "-l", Parameters.getAppName(), "-w"
        );

        AWSCredentials credentials = new BasicAWSCredentials(
                Parameters.getAccessKey(),
                password
        );

        AmazonS3 s3client = AmazonS3ClientBuilder
                .standard()
                .withCredentials(new AWSStaticCredentialsProvider(credentials))
                .withRegion(Regions.US_EAST_1)
                .build();

        ObjectListing objectListing = s3client.listObjects(Parameters.getBucketName());

        for (S3ObjectSummary s3ObjectSummary : objectListing.getObjectSummaries()) {
            names.add(s3ObjectSummary.getKey());
        }

        return ResponseEntity.status(HttpStatus.CREATED).body(names);
    }

ATM I only read the names of the images and store them in a ArrayList


Answer:

I have used the OutputStream for the purpose to send the data to the FE. I am not completely sure this is the way to go, but, here is the code provided.

@GetMapping(value = "/findAllImages")
    public ResponseEntity<List<OutputStream>> findNamesOfAllImages() {

        List<String> names = new ArrayList<>();

        String password = ProcBuilder.run(
                "security",
                "-i", "find-generic-password",
                "-l", Parameters.getAppName(), "-w"
        );

        AWSCredentials credentials = new BasicAWSCredentials(
                Parameters.getAccessKey(),
                password
        );

        AmazonS3 s3client = AmazonS3ClientBuilder
                .standard()
                .withCredentials(new AWSStaticCredentialsProvider(credentials))
                .withRegion(Regions.US_EAST_1)
                .build();

        ObjectListing objectListing = s3client.listObjects(Parameters.getBucketName());

        for (S3ObjectSummary s3ObjectSummary : objectListing.getObjectSummaries()) {
            names.add(s3ObjectSummary.getKey());
        }

        List<OutputStream> outputStreams = new ArrayList<>();
        S3Object object = null;

        File file;
        int count = 0;

        for (String name : names) {

            object = s3client.getObject(new GetObjectRequest(Parameters.getBucketName(), name));;

            InputStream reader = new BufferedInputStream(object.getObjectContent());
            file = new File(name + (++count));

            OutputStream writer;

            try {

                writer = new BufferedOutputStream(new FileOutputStream(file));
                int read = -1;

                while (true) {

                    if (!((read = reader.read()) != -1))
                        break;

                    writer.write(read);
                    outputStreams.add(writer);
                }

                writer.flush();
                writer.close();
                reader.close();

            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        return ResponseEntity.status(HttpStatus.CREATED).body(outputStreams);
    }

Question:

I am new to AWS.

What I have done :

  1. created a spring boot rest web service with a few rest api exposed
  2. checked out the api in aws ec2 instance
  3. execute the spring boot application

I have enabled required port for communication , thus I am able to access the rest APIs from browser client .

But my AWS services calls are using a secret-key pair for authentication when application context loads up for a given user .

Now I am working on removing the keys based authentication for AWS servies and start using IAM-ROLE-BASED authentication so I dont have to share keys in source code or on ec-2 instance config files.

What I understood from IAM roles , is that I have create an IAM role which will be consumed by rest apis clients for aws services authentication .

what services should I allow in AWS IAM ROLE to be able to call my rest api service ?


Answer:

  1. created an IAM role with required permission
  2. added EC2 in trust relationship for IAM role
  3. deployed my application on EC2

voila

Question:

I am using AWS S3 as a backup storage for data coming in to our Spark cluster. Data comes in every second and is processed when 10 seconds of data has been read. The RDD containing the 10 seconds of data is stored to S3 using

rdd.saveAsObjectFile(s3URL + dateFormat.format(new Date()));

This means that we get a lot of files added to S3 each day in the format of

S3URL/2017/07/23/12/00/10, S3URL/2017/07/23/12/00/20 etc

From here it is easy to restore the RDD which is a

JavaRDD<'byte[]>

using either

sc.objectFile or the AmazonS3 API

The problem is that to reduce the number of files needed to iterate through we run a daily cron job that goes through each file during a day, bunch the data together and store the new RDD to S3. This is done as follows:

List<byte[]> dataList = new ArrayList<>(); // A list of all read messages
    /* Get all messages from S3 and store them in the above list */
    try {
        final ListObjectsV2Request req = new ListObjectsV2Request().withBucketName("bucketname").withPrefix("logs/" + dateString);
        ListObjectsV2Result result;
        do {               
           result = s3Client.listObjectsV2(req);
           for (S3ObjectSummary objectSummary : 
               result.getObjectSummaries()) {
               System.out.println(" - " + objectSummary.getKey() + "  " +
                       "(size = " + objectSummary.getSize() + 
                       ")");
               if(objectSummary.getKey().contains("part-00000")){ // The messages are stored in files named "part-00000"
                   S3Object object = s3Client.getObject(
                           new GetObjectRequest(objectSummary.getBucketName(), objectSummary.getKey()));
                   InputStream objectData = object.getObjectContent();
                   byte[] byteData = new byte[(int) objectSummary.getSize()]; // The size of the messages differ
                   objectData.read(byteData);
                   dataList.add(byteData); // Add the message to the list
                   objectData.close();
               }
           }
           /* When iterating, messages are split into chunks called continuation tokens.
            * All tokens have to be iterated through to get all messages. */
           System.out.println("Next Continuation Token : " + result.getNextContinuationToken());
           req.setContinuationToken(result.getNextContinuationToken());
        } while(result.isTruncated() == true ); 
     } catch (AmazonServiceException ase) {
        System.out.println("Caught an AmazonServiceException, " +
                "which means your request made it " +
                "to Amazon S3, but was rejected with an error response " +
                "for some reason.");
        System.out.println("Error Message:    " + ase.getMessage());
        System.out.println("HTTP Status Code: " + ase.getStatusCode());
        System.out.println("AWS Error Code:   " + ase.getErrorCode());
        System.out.println("Error Type:       " + ase.getErrorType());
        System.out.println("Request ID:       " + ase.getRequestId());
    } catch (AmazonClientException ace) {
        System.out.println("Caught an AmazonClientException, " +
                "which means the client encountered " +
                "an internal error while trying to communicate" +
                " with S3, " +
                "such as not being able to access the network.");
        System.out.println("Error Message: " + ace.getMessage());
    } catch (IOException e) {
        e.printStackTrace();
    }
    JavaRDD<byte[]> messages = sc.parallelize(dataList); // Loads the messages into an RDD
    messages.saveAsObjectFile("S3URL/daily_logs/" + dateString);

This all works fine, but now I am not sure how to actually restore the data to a manageable state again. If I use

sc.objectFile

to restore the RDD I end up with a JavaRDD<'byte[]> where the byte[] is actually a JavaRDD<'byte[]> in itself. How can I restore the nested JavaRDD from the byte[] located in the JavaRDD<'byte[]>?

I hope this somehow makes sense and I am grateful for any help. In a worst case scenario I have to come up with another way to backup the data.

Best regards Mathias


Answer:

I solved it by instead of storing a nested RDD I flatmapped all the byte[] into a single JavaRDD and stored that one instead.