Hot questions for Using PDFBox in image

Question:

Hello again fellow programmers.

I can extract PDF text coordinates and its format properly. But I can't do it with image. I can get the proper width and height but it gives me wrong x and y.

I'm using Photoshop to check if I'm getting the proper x, y, width, height coordinates, but only the width and height are correct

Here is my code:

@Override
public void processOperator(Operator operator, List<COSBase> arguments) throws IOException {
    if ("cm".equals(operator.getName())) {
        float width = ((COSNumber)arguments.get(0)).floatValue();
        float height = ((COSNumber)arguments.get(3)).floatValue();
        float x = ((COSNumber)arguments.get(4)).floatValue();
        float y = ((COSNumber)arguments.get(5)).floatValue();
        System.out.println("w: " + width + " h: " + height + " x: " + x + " y: " + y);
        // process image coordinates
    }

    super.processOperator(operator, arguments);
}

And here is the example PDF I used:

http://persci.mit.edu/pub_pdfs/personal_photo_enhancement.pdf

and I'm using the page 2.

This is the output of the program:

w: 503.87997 h: 152.64 x: 71.5168 y: 561.056

I created a rectangle using Photoshop and overlay the image but only the width and height are correct.


Another problem

I used this PDF

http://www.ctex.org/documents/shredder/src/example.pdf

I used the page 17.

Why does the PDF show many coordinates, but the image in the PDF is only one?

w: 1.0 h: 1.0 x: 124.802 y: 776.998
w: 1.0 h: 1.0 x: 0.0 y: 3.587
w: 1.0 h: 1.0 x: 0.0 y: -3.985
w: 1.0 h: 1.0 x: 343.711 y: 0.398
w: 1.0 h: 1.0 x: -343.711 y: -24.906
w: 1.0 h: 1.0 x: 147.972 y: -106.0
w: 1.0 h: 1.0 x: 0.0 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 0.0
w: 0.1 h: 0.1 x: 0.0 y: 0.0
w: 1.0 h: 1.0 x: 45.0 y: 0.0
w: 1.0 h: 1.0 x: -79.37 y: -21.918
w: 1.0 h: 1.0 x: 116.507 y: 0.0
w: 1.0 h: 1.0 x: -230.109 y: -2.145
w: 1.0 h: 1.0 x: 0.0 y: -20.324
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 179.886 y: -66.21
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -215.552 y: -17.195
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: -35.666 y: -76.173
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -4.981 y: -41.843
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -4.981 y: -51.806
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: 175.592 y: -19.925
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -185.554 y: -19.925
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: -37.121
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 282.916 y: -18.389
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -318.582 y: -17.196
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 11.988 y: -11.216
w: 1.0 h: 1.0 x: 0.0 y: -14.833
w: 1.0 h: 1.0 x: 3.388 y: 4.926
w: 1.0 h: 1.0 x: 60.357 y: -4.926
w: 1.0 h: 1.0 x: -63.745 y: -0.399
w: 1.0 h: 1.0 x: 63.944 y: -3.985
w: 1.0 h: 1.0 x: -59.959 y: 0.0
w: 1.0 h: 1.0 x: 64.143 y: 0.0
w: 1.0 h: 1.0 x: -110.801 y: -13.101
w: 1.0 h: 1.0 x: 0.0 y: -2.241
w: 1.0 h: 1.0 x: 39.308 y: 2.241
w: 1.0 h: 1.0 x: 0.0 y: -2.241
w: 1.0 h: 1.0 x: -37.066 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 13.294
w: 1.0 h: 1.0 x: 1.145 y: -9.907
w: 1.0 h: 1.0 x: 39.641 y: 11.302
w: 1.0 h: 1.0 x: 0.0 y: -15.686
w: 1.0 h: 1.0 x: 1.693 y: 14.291
w: 1.0 h: 1.0 x: 0.0 y: -12.896
w: 1.0 h: 1.0 x: 3.288 y: 2.989
w: 1.0 h: 1.0 x: 47.544 y: -2.989
w: 1.0 h: 1.0 x: -50.832 y: -0.299
w: 1.0 h: 1.0 x: 52.227 y: -1.096
w: 1.0 h: 1.0 x: -53.92 y: -0.597
w: 1.0 h: 1.0 x: 57.838 y: 14.888
w: 1.0 h: 1.0 x: 0.0 y: -11.22
w: 1.0 h: 1.0 x: 0.0 y: -2.473
w: 1.0 h: 1.0 x: 42.751 y: 2.473
w: 1.0 h: 1.0 x: 0.0 y: -2.473
w: 1.0 h: 1.0 x: -40.278 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 13.693
w: 1.0 h: 1.0 x: 1.313 y: -9.907
w: 1.0 h: 1.0 x: -104.652 y: -78.762
w: 1.0 h: 1.0 x: 166.874 y: 0.0
w: 1.0 h: 1.0 x: 176.837 y: 0.0

Answer:

The cause of the problems

Your code does not really look for image positions and sizes, merely under friendly circumstances it finds them.

Your code only shows a single method without explicit context (which, I presume, is the reason why no one seriously analyzed that code and spotted the issue).

Considering the context (PDFBox, content stream analysis), though, I assume that you created an operator processor class in which you overrode the processOperator method according to the posted code. Furthermore, I assume, you registered your operator processor for the cm instruction with some PDF stream engine and ran that against your sample PDFs.

Given these assumptions it is pretty clear why the output from your operator processor only sometimes contains image size and position but often many unrelated data sets:

The effect of the instruction cm is merely to change the current transformation matrix, it is not immediately or singularly related to drawing bitmap images!

Confer the PDF specification:

Operands Operator Description

a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.

(Table 57 – Graphics State Operators – ISO 32000-1)

The only reason why the cm parameters every once in a while do contain image size and position information is that the bitmap drawing operators draw images to an 1x1 area (in user space unit) whose lower left corner is the origin, and to stretch and move the coordinate system so that this area eventually corresponds to desired image size on the result page, PDF processors modify the current transformation matrix accordingly using the cm instruction before drawing the image, often right before.

If they do so in one step (as quoted above cm concatenates the specified matrix to the CTM, it does not replace it) and don't use rotations or similar niceties, a and d (the first and the fourth cm parameters) indeed contain the size of the image on the page (in default user space units) and e and f (the fifth and the sixth cm parameters) contain the coordinates of its lower left corner.

How to do it correctly

Thus, instead of merely looking at the cm parameters, one has to

  • parse the content stream in question,
  • calculate the concatenation of all matrices applied to the CTM (also keeping track of the effects of intermediary q and Q instructions), and
  • retrieve the values of the current transformation matrix when the Do instruction for a bitmap image resource occurs.

Fortunately PDFBox already does all the heavy lifting for you under the hood if you let it, cf. the PrintImageLocations examples at

Concerning your questions

The coordinates you got for "personal_photo_enhancement.pdf" page 2 were correct as far as the PDF coordinate system is concerned. Probably Photoshop uses a different coordinate system or you inspected the wrong image corner.

You got very many outputs for "example.pdf" page 17 because that PDF uses CTM manipulations not only for sizing and positioning images but for other effects, too, mostly for translating the coordinate system origin. Futhermore, the image on that page is not a bitmap. Thus, it does not have a simple position and size...

Question:

I'm playing around with the 2.0.0-SNAPSHOT, and I want to set the page to landscape and also rotate my picture. So I've done page.setRotation(90);

There seems to be a bug with using PDFBox and AffineTransform

This code doesn't do anything like I'd expect:

AffineTransform at = new AffineTransform(w, 0, 0, h, 20, 20);
at.translate(0.5, 1);
at.rotate(Math.toRadians(90));

Width and Height have to be tiny to keep the image on the page, rotate by itself squishes the image, and translate before rotate seems to scale the image huge.

Is this a bug, or am I just not understanding PDFBox?


Answer:

Don't do an extra translation, instead put the translation when creating the AT. Remember that the rotation is around the bottom-left axis, so add the width w to the x-position.

    PDPage page = new PDPage();
    document.addPage(page);
    page.setRotation(90);
    PDPageContentStream contentStream = new PDPageContentStream(document, page);

    int x = 150;
    int y = 300;

    // draw unrotated
    contentStream.drawXObject(ximage, x, y, ximage.getWidth() / 2, ximage.getHeight() / 2);

    // draw 90° rotated, placed on the right of the first image
    AffineTransform at = new AffineTransform(ximage.getHeight() / 2, 0, 0, ximage.getWidth() / 2, x + ximage1.getWidth(), y);
    at.rotate(Math.toRadians(90));
    contentStream.drawXObject(ximage, at);

This will draw the image twice, once normally and once rotated 90°, and positioned to the right. "/2" is used to scale 50%, you can of course use another factor. Note that "/2" is not used for the initial x position, because the (scaled) width is needed twice. Once to position to the old position (because of the axis!), and once to move it to the right so that the images don't overlap.

Note also that getHeight() and getWidth() are reversed, for the page rotation.

Question:

I am using PDFBox 2.0. While parsing a PDF document, I also want to get first page as image and store it to hbase for using it in search results(I am going to create a search list page like search page of amazon.com).

HBase accepts byte[] variable to store(index) a value. I need to convert the image as byte[], then store it to HBase. I have implemented image render, but how can I convert it to byte[]?

        PDDocument document = PDDocument.load(file, "");
        BufferedImage image = null;
        try {
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            if (document.isEncrypted()) {
                try {
                    System.out.println("Trying to decrypt...);
                    document.setAllSecurityToBeRemoved(true);
                    System.out.println("The file has been decrypted in .");
                }
                catch (Exception e) {
                    throw new Exception("cannot be decrypted. ", e);
                }
            }
            PDPage firstPage = (PDPage) document.getDocumentCatalog().getPages().get(0);
            pdfRenderer.renderImageWithDPI(0, 300, ImageType.RGB);
               // 0 means first page.

            image = pdfRenderer.renderImageWithDPI(0, 300, ImageType.RGB);                  
            document.close();

    } catch (Exception e) {
            e.printStackTrace();
    } 

If I write ImageIOUtil.writeImage(image , fileName+".jpg" ,300); above right above document.close();, program creates a jpg file in project path. I need to put it in a byte[] array instead of creating a file. Is it possible?


Answer:

This can be done with ImageIO.write(Image, String, OutputStream) which can write to an arbitrary OutputStream rather than disk. ByteArrayOutputStream can store the output bytes into an array in memory.

import java.io.ByteArrayOutputStream;
...
// example image
BufferedImage image = new BufferedImage(4, 3, BufferedImage.TYPE_INT_ARGB);

// to array
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ImageIO.write(image, "jpg", bos);
byte [] output = bos.toByteArray();
System.out.println(Arrays.toString(output));

Question:

My aim is it to draw an uploaded image of which I do not know the dimensions on a PDF file with one empty page (DIN A4). For horizontal images I have a PDF file with one horizontal empty page and for vertical images I have a PDF file with one vertical page.

This is my code so far:

File image = convertMultipartFileToFile(file); //I get a MultipartFile from my RequestParam (Spring) - converting works fine
BufferedImage awtImage = ImageIO.read(image);

String path = "";

if (awtImage.getWidth() > awtImage.getHeight()) {
    path = MyController.class.getResource("/pdf4ImageUploadHorizontal.pdf").getPath();
} else {
    path = MyController.class.getResource("/pdf4ImageUploadVertical.pdf").getPath();
}

pdf = new File(path);
PDDocument doc = PDDocument.load(pdf);
PDPage page = doc.getPage(0);
int actualPDFWidth = 0;
int actualPDFHeight = 0;
if (awtImage.getWidth() > awtImage.getHeight()) {

    actualPDFWidth = (int) PDRectangle.A4.getHeight();
    actualPDFHeight = (int) PDRectangle.A4.getWidth();
} else {
    actualPDFWidth = (int) PDRectangle.A4.getWidth();
    actualPDFHeight = (int) PDRectangle.A4.getHeight();
}

// Add image to page
PDImageXObject pdImage = PDImageXObject.createFromFileByContent(image, doc);

Dimension scaledDim = getScaledDimension(new Dimension(pdImage.getWidth(), pdImage.getHeight()), new Dimension(actualPDFWidth, actualPDFHeight)); // I'm using this function: https://stackoverflow.com/questions/23223716/scaled-image-blurry-in-pdfbox

PDPageContentStream contentStream = new PDPageContentStream(doc, page);

contentStream.drawImage(pdImage, 0, 0, scaledDim.width, scaledDim.height);
contentStream.close();
doc.save("c:\\xyz\\pdf.pdf");

For vertical images everything works fine (I would prefer the images to be centered on the page but that'd be the next step).

Problem is with horizontal images: instead of my uploaded horizontal image filling the complete horizontal pdf page I get a horizontal pdf page with my image on the left being rotated 90° to the right and fitting from top to bottom (scaling worked but not the way I hoped for):

My wish is to insert uploaded horizontal or vertical pictures correctly without rotation into to the intended PDF page.


Answer:

I know found a solution...for sure it is not the most elegant one but it works. The result I get is my vertical or horizontal image if necessary scaled down to the A4 format and centered on the page. My code:

File image = convertMultipartFileToFile(file);
BufferedImage awtImage = ImageIO.read(image);

// check if horizontal or vertical
Boolean isHorizontal = false;
if (awtImage.getWidth() > awtImage.getHeight()) {
    isHorizontal = true;
}
String path = "";

// get actual height and width of pdf page 'cause pdfbox sees page always as vertical and only saves the rotation   
// ....-------------------
// ...|...................|
// ...|.........A4........|...0.x
// ...|......PDF.page.....|..0y-|----------------------------
// ...|......vertical.....|.....|............A4..............|
// ...|...._________......|.....|.........PDF.page...........|
// ...|...(.........).....|.....|........horizontal..........|
// ...|...(..image..).....|.....|...._______________.........|
// ...|...(.........).....|.....|...(...............)........|
// ...|...(.........).....|.....|...(....image......)........|
// ...|...(.........).....|.....|...(_______________)........|
// ...|...(_________).....|.....|----------------------------
// 0x-|-------------------
// ..0y
int actualPDFWidth = 0;
int actualPDFHeight = 0;
if (isHorizontal) {
    actualPDFWidth = (int) PDRectangle.A4.getHeight();
    actualPDFHeight = (int) PDRectangle.A4.getWidth();
    path = MyController.class.getResource("/pdf4ImageUploadHorizontal.pdf").getPath();
} else {
    actualPDFWidth = (int) PDRectangle.A4.getWidth();
    actualPDFHeight = (int) PDRectangle.A4.getHeight();
    path = MyController.class.getResource("/pdf4ImageUploadVertical.pdf").getPath();
}

pdf = new File(path);
PDDocument doc = PDDocument.load(pdf);
PDPage page = doc.getPage(0);

PDImageXObject pdImage = PDImageXObject.createFromFileByContent(image, doc);
PDPageContentStream contentStream = new PDPageContentStream(doc, page);

// scale image
Dimension scaledDim = getScaledDimension(new Dimension(pdImage.getWidth(), pdImage.getHeight()), new Dimension(actualPDFWidth, actualPDFHeight)); // I'm using this function: https://stackoverflow.com/questions/23223716/scaled-image-blurry-in-pdfbox

// if horizontal rotate 90°, calculate position and draw on page
if (isHorizontal) {
    int x = (int) PDRectangle.A4.getWidth() - (((int) PDRectangle.A4.getWidth() - scaledDim.height) /2);
    int y = ((int) PDRectangle.A4.getHeight() - scaledDim.width) / 2;
    AffineTransform at = new AffineTransform(scaledDim.getHeight(), 0, 0, scaledDim.getWidth(), x, y);
    at.rotate(Math.toRadians(90));
    Matrix m = new Matrix(at);
    contentStream.drawImage(pdImage, m);
} else {
    int x = ((int) PDRectangle.A4.getWidth() - scaledDim.width) / 2;
    int y = ((int) PDRectangle.A4.getHeight() - scaledDim.height) / 2;
    contentStream.drawImage(pdImage, x, y, scaledDim.width, scaledDim.height);
}

contentStream.close();
doc.save("c:\\xyz\\pdf.pdf");           
doc.close();

Please correct me if there's something wrong.

Question:

I am using this code: https://www.tutorialspoint.com/pdfbox/pdfbox_inserting_image.htm

To help me add an image to an existing PDF. The problem is that the file it creates is a blank page with only the image on it.

Here is my code:

public void signPDF(PdfDTO pdfDTO) throws IOException{
        //Loading an existing document
        File file = new File(getAbsolutePdfPath(pdfDTO));
        PDDocument doc = PDDocument.load(file);

        //Retrieving the page
        PDPage page = doc.getPage(0);

        //a test to ensure the doc is loading correctly
        PDDocument testDoc = new PDDocument();
        testDoc.addPage(page);
        testDoc.save("C:" + File.separator + "Users" + File.separator + "kdotson" + File.separator + "Documents" + File.separator + "test.pdf");
        testDoc.close(); //this file is good so I know the doc is loading correctly

        //Creating PDImageXObject object
        PDImageXObject pdImage = PDImageXObject.createFromFile("C://test_images/signature.pdf", doc);

        //creating the PDPageContentStream object
        PDPageContentStream contents = new PDPageContentStream(doc, page);

        //Drawing the image in the PDF document
        contents.drawImage(pdImage, 0, 0);

        //Closing the PDPageContentStream object
        contents.close();

        //Saving the document
        doc.save(new File(getSignedPdfLocation(pdfDTO))); //the created file has the image on it, so I know the image is loading correctly

        //Closing the document
        doc.close();
    }

As far as I can tell, what I'm doing should work, and I don't get any errors, so what gives?


Answer:

Please also have a look at the JavaDocs and sources of the library you try to work with. You create a PDPageContentStream:

PDPageContentStream contents = new PDPageContentStream(doc, page);

This conductor is documented to overwrite all existing content streams of this page:

/**
 * Create a new PDPage content stream. This constructor overwrites all existing content streams
 * of this page.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage) throws IOException

Thus, you have to use a different constructor which keeps the current page contents, e.g.

/**
 * Create a new PDPage content stream.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @param appendContent Indicates whether content will be overwritten, appended or prepended.
 * @param compress Tell if the content stream should compress the page contents.
 * @param resetContext Tell if the graphic context should be reset. This is only relevant when
 * the appendContent parameter is set to {@link AppendMode#APPEND}. You should use this when
 * appending to an existing stream, because the existing stream may have changed graphic
 * properties (e.g. scaling, rotation).
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
                           boolean compress, boolean resetContext) throws IOException

Thus

PDPageContentStream contents = new PDPageContentStream(doc, page, AppendMode.APPEND, true, true);

should make your code work as desired.

Alternatively, if you want the image in the background, try

PDPageContentStream contents = new PDPageContentStream(doc, page, AppendMode.PREPEND, true, true);

Beware, though, in certain cases the image won't be visible in the background, e.g. if the existing content starts with an instruction to fill the whole page area in white. In such a case watermarks must be applied with some kind of transparency atop existing content.

Question:

I've added images to a document using code as supplied by Nick Russler in answer to another question here https://stackoverflow.com/a/20618152/4652269

/**
 * Draw an image to the specified coordinates onto a single page. <br>
 * Also scaled the image with the specified factor.
 * 
 * @author Nick Russler
 * @param document PDF document the image should be written to.
 * @param pdfpage Page number of the page in which the image should be written to.
 * @param x X coordinate on the page where the left bottom corner of the image should be located. Regard that 0 is the left bottom of the pdf page.
 * @param y Y coordinate on the page where the left bottom corner of the image should be located.
 * @param scale Factor used to resize the image.
 * @param imageFilePath Filepath of the image that is written to the PDF.
 * @throws IOException
 */
public static void addImageToPage(PDDocument document, int pdfpage, int x, int y, float scale, String imageFilePath) throws IOException {   
    // Convert the image to TYPE_4BYTE_ABGR so PDFBox won't throw exceptions (e.g. for transparent png's).
    BufferedImage tmp_image = ImageIO.read(new File(imageFilePath));
    BufferedImage image = new BufferedImage(tmp_image.getWidth(), tmp_image.getHeight(), BufferedImage.TYPE_4BYTE_ABGR);        
    image.createGraphics().drawRenderedImage(tmp_image, null);

    PDXObjectImage ximage = new PDPixelMap(document, image);

    PDPage page = (PDPage)document.getDocumentCatalog().getAllPages().get(pdfpage);

    PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
    contentStream.drawXObject(ximage, x, y, ximage.getWidth()*scale, ximage.getHeight()*scale);
    contentStream.close();
}

Basically the image is added to PDF page via an XObjectImage, however I am finding the same code gets different results based on the PDF being used. My guess is there seems to be some scale or transform in play but I cant work out where to find or correct this.

The page reports (from MediaBox PDRectangle) that it is (approximately) 600x800 (page units). but when I place my 500px image it displays differently based on the PDF in use. In one PDF it comes out at the width of the page (this a generated PDF - ie text and objects etc). In another PDF the image is about about half to a third the width (this PDF a scanned A4 TIF image on a PDF page - the image is about 1700x2300px - which lines up with the ratio of shrinking that is occurring to my image), and finally another TIF image on a PDF page, my added image also gets rotated through 90 degrees.

It appears obvious to me that I need to add or modify a transform - that the page has a default - or is remembering the last transform used, all I want is 1:1 ratio and 0 degrees rotation, but I don't know how to do this?

I've read about Matrix and AffineTransformations - but it's not making a lot of sense to me.

Is there a way to set the document or the drawXObject to be a very 1:1 scale with 0 degrees rotation?


Answer:

My guess is there seems to be some scale or transform in play but I cant work out where to find or correct this.

Yes, your code

PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);

adds a new content stream at the end of the list of content streams of the page as is. This implies that it starts with the graphics state the formerly last stream ended with.

Some tools create content streams which end in the same state as they start in but this is not a requirement imposed by the PDF specification.

To make sure that your additions start with the default graphics state, you have to envelop the existing content in a pair of operators q...Q which save and restore the graphics state.

Fortunately PDFBox already does this for you if you use a different PDPageContentStream constructor, the one with three boolean parameters, and use true as value for the additional parameter:

PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true, true);

Question:

I am using PDFBox to generate reports in Java. One of my requirements is to create a PDF document which contains the company logo at the top of the page. I am not able to find the way to accomplish that.

I have the following method in a Java class:

public void createPdf() {   

        PDDocument document = null;

        PDPage page = null;

        ServletContext servletContext = (ServletContext) FacesContext
                .getCurrentInstance().getExternalContext().getContext();

        try {

            File f = new File("Afiliado_2.pdf");

            if (f.exists() && !f.isDirectory()) {
                document = PDDocument.load(new File("Afiliado_2.pdf"));

                page = document.getPage(0);
            } else {

                document = new PDDocument();

                page = new PDPage();

                document.addPage(page);
            }

            PDImageXObject pdImage = PDImageXObject.createFromFile(
                    servletContext.getRealPath("/resources/images/logo.jpg"),
                    document);

            PDPageContentStream contentStream = new PDPageContentStream(
                    document, page, AppendMode.APPEND, true);


            contentStream.drawImage(pdImage, 0, 0);

            // Make sure that the content stream is closed:
            contentStream.close();

            // Save the results and ensure that the document is properly closed:
            document.save("Afiliado_2.pdf");
            document.close();

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

The image is currently appearing in the bottom of the PDF. I know the line I need to modify is contentStream.drawImage(pdImage, 0, 0); but what coordinates do I need to specify so that is appears in the top of the page?


Answer:

Typically the coordinate system for a page in PDF starts at the lower left corner. So with

contentStream.drawImage(pdImage, 0, 0);

you are drawing your image at that point. You can get the boundaries of your page using

page.getMediaBox();

and use that to position your image e.g.

PDRectangle mediaBox = page.getMediaBox();

// draw with the starting point 1 inch to the left
// and 2 inch from the top of the page
contentStream.drawImage(pdImage, 72, mediaBox.getHeight() - 2 * 72);

where PDF files normally specify 72 points to 1 physical inch.

Question:

I am using PDFBox to extract the images from my pdf (which contains only jpg's).

Since I will save those images inside my database, I would like to directly convert each image to an inputstream object first without placing the file temporary on my file sysem. I am facing difficulties with this however. I think it has to do because of the use of image.getPDFStream().createInputStream() as I did in the following example:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().createInputStream(); //this gives me a corrupt file
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

However this works:

while (iter.hasNext()) {
    PDPage page = (PDPage) iter.next();
    PDResources resources = page.getResources();
    Map<String, PDXObject> images = resources.getXObjects();
        if (images != null) {
            Iterator<?> imageIter = images.keySet().iterator();
            while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage image = (PDXObjectImage) images.get(key);
            image.write2file(new File("C:\\Users\\Anton\\Documents\\lol\\test.jpg")); //this works however
        }
    }
}

Any idea how I can convert each PDXObjectImage (or any other object I can get) to an inputstream?


Answer:

In PDFBox 1.8, the easiest way is to use write2OutputStream(), so your first code block would now look like this:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    image.write2OutputStream(output);
}

advanced solution, as long as you're really sure you have only JPEGs that display properly, i.e. have no unusual colorspace:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().getPartiallyFilteredStream(DCT_FILTERS);
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

The second solution removes all filters except the DCT (= JPEG) filter. Some older PDFs have several filters, e.g. ascii85 and DCT.

Now even if you created the image with JPEGs, you don't know what your PDF creation software did. One way to find out what type of image it is, is to check what class it is (use instanceof):

- PDPixelMap => PNG
- PDJpeg => JPEG
- PDCcitt => TIF

Another way is to use image.getSuffix().

Question:

I have been using PDFBox to generate pdf files and was wondering if it is possible to add a border around images. If not, is there some algorithm that allows you to efficiently draw lines precisely around the image? I have the following code that allows myself to add an image to a pdf page:

//image for page 2
public File processPDF()
{
    //creating pdf
    PDDocument document = new PDDocument();
    File file = new File("NWProofReference.pdf");

    //adding first page to pdf, blank
    PDPage page = new PDPage();
    PDPageContentStream contentStream;

    try {
            BufferedImage awtImage = ImageIO.read(new File(PDFProcessing.image));
            PDXObjectImage ximage = new PDPixelMap(document, awtImage);
            float scale = 1.0f; // alter this value to set the image size
            contentStream.drawXObject(ximage,100,400, 
            (ximage.getWidth()*scale,ximage.getHeight()*scale);
            contentStream.close();

            document.save(file);
            document.close();
        } catch (Exception e)
        {
            e.printStackTrace();
        }

    return file;
}

Using this or any code, is there any way to actually add a border around the image itself that is made available through the PDFBox API?


Answer:

Here's some code that adds a red border:

        BufferedImage awtImage = ImageIO.read(new File(PDFProcessing.image));
        PDXObjectImage ximage = new PDPixelMap(document, awtImage);
        float scale = 1.0f; // alter this value to set the image size
        contentStream.drawXObject(ximage,100,400,ximage.getWidth()*scale,ximage.getHeight()*scale);
        // these three lines are new
        contentStream.setStrokingColor(Color.red);
        contentStream.addRect(100-3, 400-3, ximage.getWidth()*scale+6, ximage.getHeight()*scale+6);
        contentStream.closeAndStroke();

        contentStream.close();

good luck! You can of course change the "3" to a smaller number.

Question:

I load my image to the pdf in the following way:

PDImageXObject image= PDImageXObject.createFromFile(<image_path>, doc);
contentStream.drawImage(image, 15, pdfData.getPageHeight() - 80,
image.getWidth(), image.getHeight());

I'm trying to make the image look transparent, like it would look in a header of a document(google docs, word etc.) is there an easy way to do this?


Answer:

Use an extended graphics state:

stream.saveGraphicsState();
PDExtendedGraphicsState pdExtGfxState = new PDExtendedGraphicsState();
pdExtGfxState.getCOSObject().setItem(COSName.BM, COSName.MULTIPLY); // pdExtGfxState.setBlendMode(BlendMode.MULTIPLY) doesn't work yet, maybe in later version 
pdExtGfxState.setNonStrokingAlphaConstant(0.5f);
contentStream.setGraphicsStateParameters(pdExtGfxState);
// do your stuff
stream.restoreGraphicsState();

Question:

I'm building a tool to compress PDF files, and using pdfbox. I have some images with the DCTDecode + FlateDecode filter and I'd like to experiment with the JPXDecode filter to see if it occupies less space.

I've seen some code using iText, but how to do it with pdfbox?. I've found no documentation how to do so.


Answer:

This code replaces the image stream without having to alter COSWriter (which sounds scary), however my experience with the PDF I tried was that the encoded image was incorrect, i.e. that there is a bug in the JPEG 2000 encoder, so check your result PDFs.

public class SO57972743
{
    public static void main(String[] args) throws IOException
    {
        System.out.println("supported formats: " + Arrays.toString(ImageIO.getReaderFormatNames()));

        try (PDDocument doc = PDDocument.load(new File("test.pdf")))
        {
            // get 1st level images only here (there may be more in form XObjects!)
            PDResources res = doc.getPage(0).getResources();
            for (COSName name : res.getXObjectNames())
            {
                PDXObject xObject = res.getXObject(name);
                if (xObject instanceof PDImageXObject)
                {
                    replaceImageWithJPX(xObject);
                }
            }
            doc.save("test-result.pdf");
        }
    }

    private static void replaceImageWithJPX(PDXObject xObject) throws IOException
    {
        PDImageXObject img = (PDImageXObject) xObject;
        BufferedImage bim = img.getOpaqueImage(); // the mask (if there) won't be touched
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        boolean written = ImageIO.write(bim, "JPEG2000", baos);
        if (!written)
        {
            System.err.println("write failed");
            return;
        }
        // replace image stream
        try (OutputStream os = img.getCOSObject().createRawOutputStream())
        {
            os.write(baos.toByteArray());
        }
        img.getCOSObject().setItem(COSName.FILTER, COSName.JPX_DECODE); // replace filter
        img.getCOSObject().removeItem(COSName.COLORSPACE); // use the colorspace in the image itself
    }
}

Question:

As discussed in this question (Wrap image to Jframe), i need a jframe to match the exact provided image (The image itself is originally a PDF which has been converted to an image)

The solution provided does indeed build a jframe to my image dimensions, but i can't actually see all of the image. I need to be able to resize the jframe, with the image dynamically adjusting to the new jframe size. Failing that, i think if i could just scroll the jframe or even zoom in or out, i could at least get to the parts of the image that i currently cannot see.

The reason i need this is that, within my code, i have an option to draw a Rectangle2D against the image - the code spits out the co-ordinates as java.awt.geom.Rectangle2D$Float[x,y,w,h].

I will then use these co-ordinates to extract the region against the original PDF using PDFTextStripperbyArea class from Apache PDFbox. PDFTextStripperbyArea takes its input as Rectangle2D measurements. Hence, the image and the jframe must always be the same size in order to retrieve accurate co-ordinates.

Can anybody help?


Answer:

To warp the label with a scroll pane you could implement the following changes:

    //frame.setLayout(new FlowLayout()); - comment out -  use default (Borderlayout)
    JLabel lbl= new JLabel();
    lbl.setIcon(icon);
    JScrollPane jsp = new JScrollPane(lbl); //warp the label with a scrollpane 
    //frame.add(lbl); 
    frame.add(jsp); //add scrollpane to frame instead of lbl

You can find more information here.

Question:

I have this method which can extract text in a specific location in the pdf

public static void getTextByRectangle(PDDocument doc,Rectangle rect) throws IOException{
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition( true );
    stripper.addRegion( "class1", rect );
    PDPage firstPage = doc.getPage(0);
    stripper.extractRegions( firstPage );
    System.out.println( "Text in the area:" + rect );
    System.out.println( stripper.getTextForRegion( "class1" ) );
}

Is it possible to do the same thing but for extracting images??


Answer:

Yes, you can extract all images, and compare the position of rect and images. Here is the example by pdfbox. This can get image locations.

  1. You need create a class extends PDFStreamEngine. Like this,

    public class PrintImageLocations extends PDFStreamEngine

  2. You should override processOperator. And from ctmNew, you can get the image location, then compare image with yours rect, you will get the right image.

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands)  throws IOException {
        String operation = operator.getName();
        if ("Do".equals(operation)) {
            COSName objectName = (COSName) operands.get(0);
            PDXObject xobject = getResources().getXObject(objectName);
            if (xobject instanceof PDImageXObject) {
                 PDImageXObject image = (PDImageXObject) xobject;
                 Matrix ctmNew = getGraphicsState().getCurrentTransformationMatrix();
                 float imageXScale = ctmNew.getScalingFactorX();
                 float imageYScale = ctmNew.getScalingFactorY();
                 // position in user space units. 1 unit = 1/72 inch at 72 dpi
                 System.out.println("position in PDF = " + ctmNew.getTranslateX() + ", " + ctmNew.getTranslateY() + " in user space units");
                 // displayed size in user space units
                 System.out.println("displayed size  = " + imageXScale + ", " + imageYScale + " in user space units");
            } else if (xobject instanceof PDFormXObject) {
                 PDFormXObject form = (PDFormXObject) xobject;
                 showForm(form);
            }
        } else {
            super.processOperator(operator, operands);
        }
    }
    

Thanks mkl and FiReTiTi's advice.

Question:

I have a file jpg file: 2480 x 3508 pixels which is the suitable size for 4A. I need to put this file in a pdf.

PDDocument doc = new PDDocument();
ByteArrayOutputStream os = new ByteArrayOutputStream();
InputStream certificate = getClass().getResourceAsStream("certificate.jpg");
BufferedImage bi = ImageIO.read(certificate);
PDPage page = new PDPage(PDRectangle.A4);//<<---- A4
doc.addPage(page);

PDImageXObject pdImageXObject = LosslessFactory.createFromImage(doc, bi);
PDPageContentStream contentStream = new PDPageContentStream(doc, page, 
PDPageContentStream.AppendMode.APPEND, false);
contentStream.drawImage(pdImageXObject, 0, -10);
contentStream.close();
doc.save( "c://appfiles//PDF_image.pdf" );
doc.close();

The problem is that the generated file is totally off and not fitting the A4 size in the PDF.

The source file: https://i.stack.imgur.com/a0ZHG.jpg The generated File: https://www.dropbox.com/s/ufo3246b6eoz3f5/PDF_image.pdf?dl=1

I know I can play with the width and height but then the printing quality drops and I think the PDRectangle.A4 was intended to prevent these kind of manipulations.

How can I make the 2480 x 3508 pixels fit to PDRectangle.A4 pdf page?

Thanks


Answer:

The dimensions of PDF are on 72 dpi,

System.out.println(PDRectangle.A4); // output is [0.0,0.0,595.27563,841.8898]

your image is 300 dpi, so you have to scale:

contentStream.drawImage(pdImageXObject, 0f, -10f, 
        pdImageXObject.getWidth() / 300f * 72, 
        pdImageXObject.getHeight() / 300f * 72);

I also recommend to use JPEGFactory.createFromStream(), this is faster, smaller and uses the jpeg stream directly. Your result PDF file is 580 KB instead of 2555 KB.

Question:

I want to convert some images to PDDocument object, and not save to hardware. How to get input stream from this PDDocument object? I wrote as blow, got "Create InputStream called without data being written before to stream." error.

the part of source is:

public ByteArrayOutputStream imagesToPdf(final List<ImageEntity> images) throws IOException {

    final PDDocument doc = new PDDocument();

    final int count = images.size();
    InputStream in = null;
    PDPageContentStream contentStream = null;
    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    try {

        for(int i = 0; i < count; i ++) {

            final ImageEntity image = images.get(i);

            byte[] byteCode = image.getByteCode();

            in = new ByteArrayInputStream(byteCode);

            BufferedImage bi = ImageIO.read(in);
            float width = bi.getWidth();
            float height = bi.getHeight();
            PDPage page = new PDPage(new PDRectangle(width, height));
            doc.addPage(page); 

            PDImageXObject pdImage = PDImageXObject.createFromByteArray(doc, byteCode, null);
            contentStream = new PDPageContentStream(doc, page, AppendMode.APPEND, true, true);

            float scale = 1f;
            contentStream.drawImage(pdImage, 0, 0, pdImage.getWidth()*scale, pdImage.getHeight()*scale);

            IOUtils.closeQuietly(contentStream);
            IOUtils.closeQuietly(in);
        }
        PDStream ps = new PDStream(doc);
        is = ps.createInputStream();
        IOUtils.copy(is, baos);
        return baos;

    } finally {

        IOUtils.closeQuietly(contentStream);
        IOUtils.closeQuietly(in);
    }
}

Answer:

new PDStream(doc)

does not create an object from which the doc can be retrieved in serialized form as you assume. What it actually does is create a PDF stream object belonging to the document doc.

What you want to do is simply

doc.save(baos);

Question:

I want only images at their respective positions as in the pdf with its exact layout but I don't want text to render in it Is there any way to do it currently I am working in this way but text also coming in this way so is there any way to meet that requirement

             File sourceFile=new File(pdfFile);
             String fileName = sourceFile.getName().replace(".pdf", "");             
             int pageNumber = 1;
             for (PDPage page : li) 
             {
                 BufferedImage image = page.convertToImage();
                 File outputfile = new File(imgDes + fileName +"_"+ pageNumber +".png");
                 System.out.println("Image Created -> "+ outputfile.getName());
                 ImageIO.write(image, "png", outputfile);
                 pageNumber++;
             }

Answer:

Derive a class from PageDrawer and override all methods that don't deal with images with empty, and then call drawPage(). I just overrode processTextPosition(), and didn't bother about lines, shapes etc but I think it is clear what I mean.

public class MyPageDrawer extends PageDrawer
{

    public MyPageDrawer() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
    }

    // taken from PDPage.convertToImage, with extra parameter and one modification
    static BufferedImage convertToImage(PDPage page, int imageType, int resolution) throws IOException
    {
        final Color TRANSPARENT_WHITE = new Color(255, 255, 255, 0);
        final int DEFAULT_USER_SPACE_UNIT_DPI = 72;

        PDRectangle cropBox = page.findCropBox();
        float widthPt = cropBox.getWidth();
        float heightPt = cropBox.getHeight();
        float scaling = resolution / (float) DEFAULT_USER_SPACE_UNIT_DPI;
        int widthPx = Math.round(widthPt * scaling);
        int heightPx = Math.round(heightPt * scaling);
        Dimension pageDimension = new Dimension((int) widthPt, (int) heightPt);
        int rotationAngle = page.findRotation();
        // normalize the rotation angle
        if (rotationAngle < 0)
        {
            rotationAngle += 360;
        }
        else if (rotationAngle >= 360)
        {
            rotationAngle -= 360;
        }
        // swap width and height
        BufferedImage retval;
        if (rotationAngle == 90 || rotationAngle == 270)
        {
            retval = new BufferedImage(heightPx, widthPx, imageType);
        }
        else
        {
            retval = new BufferedImage(widthPx, heightPx, imageType);
        }
        Graphics2D graphics = (Graphics2D) retval.getGraphics();
        graphics.setBackground(TRANSPARENT_WHITE);
        graphics.clearRect(0, 0, retval.getWidth(), retval.getHeight());
        if (rotationAngle != 0)
        {
            int translateX = 0;
            int translateY = 0;
            switch (rotationAngle)
            {
                case 90:
                    translateX = retval.getWidth();
                    break;
                case 270:
                    translateY = retval.getHeight();
                    break;
                case 180:
                    translateX = retval.getWidth();
                    translateY = retval.getHeight();
                    break;
                default:
                    break;
            }
            graphics.translate(translateX, translateY);
            graphics.rotate((float) Math.toRadians(rotationAngle));
        }
        graphics.scale(scaling, scaling);
        PageDrawer drawer = new MyPageDrawer(); // MyPageDrawer instead of PageDrawer
        drawer.drawPage(graphics, page, pageDimension);
        drawer.dispose();
        graphics.dispose();
        return retval;
    }

    public static void main(String[] args) throws IOException
    {
        String filename = "......./blah.pdf";

        // open the document
        PDDocument doc = PDDocument.loadNonSeq(new File(filename), null);
        List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
        for (int p = 0; p < pages.size(); ++p)
        {
            PDPage page = pages.get(p);
            BufferedImage bim = convertToImage(page, BufferedImage.TYPE_INT_RGB, 300);

            boolean b = ImageIOUtil.writeImage(bim, "page-" + (p + 1) + ".png", 300);
            if (!b)
            {
                // error handling
            }
        }
        doc.close();
    }

}

Question:

I'm working with PDFBox and trying to rotate an image and have it position correctly on the screen. The design editor I'm using outputs the following information about images that may be useful.

Image bounding box top-left coords (I'm using the bottom left coords to better suit PDFBox coord space.)

Image rotation in degrees

Image width & height

The translation appears to be off.

// Rotation
AffineTransform rotation = new AffineTransform();
rotation.rotate(Math.toRadians(360 - element.getAngle()),
    element.getLeft() + scaledWidth/2,
    adjustedYPos + scaledHeight/2);
    stream.transform(new Matrix(rotation));

// Position & scale
AffineTransform mat = new AffineTransform(scaledWidth,
        0,
        0,
        scaledHeight,
        element.getLeft(),
        adjustedYPos);

// Draw the final image
stream.drawImage(pdfImage, new Matrix(mat));

Rotations are based on the center of the image as an anchor point.


Answer:

You can correctly position images using code like this:

void placeImage(PDDocument document, PDPage page, PDImageXObject image, float bbLowerLeftX, float bbLowerLeftY, float width, float height, float angle) throws IOException {
    try (   PDPageContentStream contentStream = new PDPageContentStream(document, page, AppendMode.APPEND, true, true)   ) {
        float bbWidth = (float)(Math.abs(Math.sin(angle))*height + Math.abs(Math.cos(angle))*width);
        float bbHeight = (float)(Math.abs(Math.sin(angle))*width + Math.abs(Math.cos(angle))*height);
        contentStream.transform(Matrix.getTranslateInstance((bbLowerLeftX + .5f*bbWidth), (bbLowerLeftY + .5f*bbHeight)));
        contentStream.transform(Matrix.getRotateInstance(angle, 0, 0));
        contentStream.drawImage(image, -.5f*width, -.5f*height, width, height);
    }
}

(PlaceRotatedImage utility method)

This method accepts coordinates as they are meaningful in the context of PDF, i.e. coordinate values and dimensions according to the default user space coordinate system of the given page (y values increasing upwards, the origin arbitrary but fairly fairly often in the lower left), (bounding) box given by lower left corner, angles as in math in counterclockwise radians...

If you need the parameters differently, you can fairly easily adapt the method, though. If you e.g. get the upper left corner of the bounding box instead of the lower left, you can simply subtract the bounding box height determined in the method as bbHeight to calculate the lower left y coordinate used here.

You can use this method like this:

PDPage page = ...;

PDRectangle mediaBox = page.getMediaBox();
float bbLowerLeftX = 50;
float bbLowerLeftY = 100;

try (   PDPageContentStream contentStream = new PDPageContentStream(document, page)   ) {
    contentStream.moveTo(bbLowerLeftX, mediaBox.getLowerLeftY());
    contentStream.lineTo(bbLowerLeftX, mediaBox.getUpperRightY());
    contentStream.moveTo(mediaBox.getLowerLeftX(), bbLowerLeftY);
    contentStream.lineTo(mediaBox.getUpperRightX(), bbLowerLeftY);
    contentStream.stroke();
}

PDImageXObject image = PDImageXObject.createFromByteArray(document, IOUtils.toByteArray(resource), "Image");
placeImage(document, page, image, bbLowerLeftX, bbLowerLeftY, image.getWidth(), image.getHeight(), (float)(Math.PI/4));
placeImage(document, page, image, bbLowerLeftX, bbLowerLeftY, .5f*image.getWidth(), .5f*image.getHeight(), 0);
placeImage(document, page, image, bbLowerLeftX, bbLowerLeftY, .25f*image.getWidth(), .25f*image.getHeight(), (float)(9*Math.PI/8));

(PlaceRotatedImage test testPlaceByBoundingBox)

This code draws the left and bottom lines corresponding to the left and bottom side of the given lower left bounding box coordinates and draws an image at different magnifications and angles with the constant given lower left bounding box corner.

The result looks like this:


You can find more information on the calculation of the bounding box sizes in these answers:

  • Calculate Bounding box coordinates from a rotated rectangle
  • How to get width and height of the bounding box of a rotated rectangle
  • How to get size of a rotated rectangle
  • Find the Bounding Rectangle of Rotated Rectangle
  • ...

Question:

I have problem to get an image from another package of my Eclipse project! I saw this post here My code is this CreateTableOnPDF.java:

288    ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
289    InputStream input = classLoader.getResourceAsStream("images/sun.png");
290    PDJpeg img = new PDJpeg(doc, input);

I getting this exception:

Exception in thread "main" java.lang.IllegalStateException: 
at org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg.setPropertiesFromAWT(PDJpeg.java:132)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg.<init>(PDJpeg.java:113)
at MainClasses.CreateTableOnPDF.main(CreateTableOnPDF.java:290)

I don't know what I am doing wrong? Maybe is the a lot hour a have spend already infront of my monitor.

Here is my project structure:

Thanks for your attention and time!


Answer:

Please Check the doc.. you should put jpeg data NOT PNG data

PDJpeg public PDJpeg(PDDocument doc, InputStream is)

throws IOExceptionConstruct from a stream.

Parameters:

doc - The document to create the image as part of.

is - The stream that contains the jpeg data.

Throws:

IOException - If there is an error reading the jpeg data.

Question:

I am trying to convert the image to pdf and then sending it back as response without saving getting trouble. Below is the code snippet.

PDDocument document = new PDDocument();
InputStream in = new FileInputStream(sourceFile);
BufferedImage bimg = ImageIO.read(in);
float width = bimg.getWidth();
float height = bimg.getHeight();
PDPage page = new PDPage(new PDRectangle(width, height));
document.addPage(page);
PDImageXObject img = PDImageXObject.createFromFile(sourceFile.getPath(), document);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.drawImage(img, 0, 0);
contentStream.close();
 in.close();       
 document.close();
 PDPage documentPage = document.getPage(0);
 InputStream pdfStream = documentPage.getContents();
 byte[] pdfData = new byte[pdfStream.available()];
 pdfStream.read(pdfData); 
 return Response.ok((Object) document).build();

Answer:

Try this using ByteArrayOutputStream are doing it wrong way below is the code snippet. I have edit this so it will be helpful for others as well this is how i am converting image to PDF using pdf box.Below is the working code

        @GET
@Path("/endpoint/{resourceName}")
@Produces("application/pdf")
public Response downloadPdfFile(@PathParam("resourceName") String res) throws IOException {
    File sourceFile = new File("directoryPath/"+ res+ ".png");
    if (!sourceFile.exists()) {
        return Response.status(400).entity("resource not exist").build();
    }
        PDDocument document = new PDDocument();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        InputStream in = new FileInputStream(sourceFile);
        BufferedImage bimg = ImageIO.read(in);
        float width = bimg.getWidth();
        float height = bimg.getHeight();
        PDPage page = new PDPage(new PDRectangle(width, height));
        document.addPage(page);
        PDImageXObject img = PDImageXObject.createFromFile(sourceFile.getPath(), 
        document);
        PDPageContentStream contentStream = new PDPageContentStream(document, page);
        contentStream.drawImage(img, 0, 0);
        contentStream.close();
        in.close();
        document.save(outputStream);
        document.close();
        return Response.ok(outputStream.toByteArray()).build();

Question:

Friends, I am using PDFBox 2.0.6. I have been successfull in extracting images from the pdf file, But right now it is creating an image for single pdf page. But the issue is that there can be any no. of images in a pdf page, And I want that each embedded image should be extracted as a single image itself.

Here is the code,

import java.awt.image.BufferedImage;
import java.io.File;
import javax.imageio.ImageIO;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;

public class DemoPdf {

    public static void main(String args[]) throws Exception {
        //Loading an existing PDF document
        File file = new File("C:/Users/ADMIN/Downloads/Vehicle_Photographs.pdf");
        PDDocument document = PDDocument.load(file);
        //Instantiating the PDFRenderer class
        PDFRenderer renderer = new PDFRenderer(document);
        File imageFolder = new File("C:/Users/ADMIN/Desktop/image");

        for (int page = 0; page < document.getNumberOfPages(); ++page) {
            //Rendering an image from the PDF document
            BufferedImage image = renderer.renderImage(page);
            //Writing the image to a file
            ImageIO.write(image, "JPEG", new File(imageFolder+"/" + page +".jpg"));
            System.out.println("Image created"+ page);
        }
        //Closing the document
        document.close();
    }

}   

Is it possible in PDFBox that I can extract all embedded images as separate images, Thanks


Answer:

Yes. It is possible to extract all images from all the pages in pdf.

You may refer this link, extract images from pdf using PDFBox.

The basic idea here is that, extend the class with PDFStreamEngine, and override processOperator method. Call PDFStreamEngine.processPage for all the pages. And if the object that has been passed to processOperator is an Image Object, get BufferedImage from the object, and save it.

Question:

I have a large image and I want to print it inside my PDF document, I am writing the follows:

InputStream in = new FileInputStream(new File("C:/mylargeimage.jpg"));
PDJpeg img = new PDJpeg(doc, in);
img.setHeight(100);
img.setWidth(100);
contentStream.drawImage(img,50,pageYaxis);

I am getting the image printed but the image getting blurred and not getting the full image inside. I needed the image to be resized 100X100. How can I overcome this? Please help me guys?? I had refereed a lot but nothing is clear.


Answer:

Have you tried doing it like this?

InputStream in = new FileInputStream(new File("C:/mylargeimage.jpg"));
PDJpeg img = new PDJpeg(doc, in);
contentStream.drawXObject(img, 50, pageYaxis, 100, 100);

The Javadoc for drawImage says that it will draw at the default size of the image. The Javadoc for drawXObject says:

Draw an xobject(form or image) at the x,y coordinates and a certain width and height.

Calling setWidth and setHeight on the image itself will, I believe, only change what PDFBox believes to be the real size of the image - so it takes only 100x100 pixels of the image as the source.

Question:

Is there a way to find out the size of the image (in bytes, in whatever compressed image format they're stored in) in PDImageXObject without extracting it into BufferedImage?


Answer:

Call img.getCOSObject().getLength(), this will give you the length of the COSStream on which the image is based. If the image has a mask, you'll have to do the same with it. Call img.getMask() to check whether there is one.

Question:

I've some very basic code which inserts an image into an existing PDF:

public class InsertImg
{
    public static void main (final String[] args) throws IOException
    {
        PDDocument document = PDDocument.load (new File ("original.pdf"));
        PDPage page = document.getPage (0);

        byte[] imgBytes = Files.readAllBytes (Paths.get ("signature.png"));
        PDImageXObject pdImage = PDImageXObject.createFromByteArray (document, imgBytes, "name_of_image");
        PDPageContentStream content = new PDPageContentStream (document, page, AppendMode.APPEND, true, true);
        content.drawImage (pdImage, 50.0f, 350.0f, 100.0f, 25.0f);
        content.close (); 

        document.save (new File ("result.pdf"));
        document.close ();
    }
}

While this code worked fine in PdfBox 2.08 for all image files, it works under version 2.012 only for some images and does not work anymore for all image files.

(Background: We would like to insert an image of a signature into an existing and already generated letter. The signatures are all generated with the same software. In version 2.12 not all signatures can be inserted anymore. In version 2.08 all signature could be inserted).

The generated pdf-file "result.pdf" cannot be opened in Acrobat Reader. Acrobat Reader shows only the original pdf "original.pdf", but does not display the signature-image. It says "error in page. please contact the creator of the pdf".

However, most images can be inserted, so it is likely that the problem depends on the very image used.

The images are all ok, they are png's and where checked and verified with various imaging programs, e.g. gimp or irfanview.

Furthermore, the code above has always worked fine with PdfBox 2.08. After an update of PdfBox to version 2.12, the problem showed up and also the newest version 2.16 still produces the error. Still on the same image files, and still not on all.

NB: When I put the following line into comment, then no error shows up in Acrobat Reader, so the problem must be somewhere within drawImage.

    // content.drawImage (pdImage, 50.0f, 350.0f, 100.0f, 25.0f);

and the rest of the code seems to be fine.

Also, I've just tried starting with an empty PDF and not loading an already generated one.

    PDDocument document = new PDDocument ();
    PDPage page = new PDPage ();
    document.addPage (page);
    [...]

The problem here is still the same, so the issue does not depend on the underlying PDF.


Answer:

It is a bug since 2.0.12 (wrong alternate colorspace for gray images created with the LosslessFactory) that has been fixed in PDFBOX-4607 and will be in release 2.0.17. Display works for all viewers I have tested except Adobe Reader, despite that the alternate colorspace shouldn't be used when an ICC colorspace is available. Here's some code to fix PDFs (this assumes that images are only on top level of a page, i.e. images in other structures are not considered)

for (PDPage page : doc.getPages())
{
    PDResources resources = page.getResources();
    if (resources == null)
    {
        continue;
    }
    for (COSName name : resources.getXObjectNames())
    {
        PDXObject xObject = resources.getXObject(name);
        if (xObject instanceof PDImageXObject)
        {
            PDImageXObject img = (PDImageXObject) xObject;
            if (img.getColorSpace() instanceof PDICCBased)
            {
                PDICCBased icc = (PDICCBased) img.getColorSpace();
                if (icc.getNumberOfComponents() == 1 && PDDeviceRGB.INSTANCE.equals(icc.getAlternateColorSpace()))
                {
                    List<PDColorSpace> list = new ArrayList<>();
                    list.add(PDDeviceGray.INSTANCE);
                    icc.setAlternateColorSpaces(list);
                }
            }
        }
    }
}

Question:

PDFBox offer functions to render a entire page, but no way to render only a specific rectangle of the page.

It seems the only way to achieve that would be to use PDFRenderer.renderPageToGraphics and configure the Graphics2D object so only the region of interest is rendered, but I can't figure out how to do that.

Another way would be to render the whole page, then extract a sub-image, but I would like to avoid this.


Answer:

So, it was a bit easier than I initially thought.

Here is Groovy code to do that.

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.rendering.PDFRenderer

import javax.imageio.ImageIO
import java.awt.*
import java.awt.geom.AffineTransform
import java.awt.geom.Rectangle2D
import java.awt.image.BufferedImage
import java.awt.image.RenderedImage

class RegionPDFRenderer {

    private static final int POINTS_IN_INCH = 72

    private final PDDocument document
    private final PDFRenderer renderer
    private final int resolutionDotPerInch

    RegionPDFRenderer(PDDocument document, int resolutionDotPerInch) {
        this.document = document
        this.renderer = new PDFRenderer(document)
        this.resolutionDotPerInch = resolutionDotPerInch
    }

    RenderedImage renderRect(int pageIndex, Rectangle2D rect) {
        BufferedImage image = createImage(rect)
        Graphics2D graphics = createGraphics(image, rect)
        renderer.renderPageToGraphics(pageIndex, graphics)
        graphics.dispose()
        image
    }

    private BufferedImage createImage(Rectangle2D rect) {
        double scale = resolutionDotPerInch / POINTS_IN_INCH
        int bitmapWidth  = rect.width  * scale
        int bitmapHeight = rect.height * scale
        new BufferedImage(bitmapWidth, bitmapHeight, BufferedImage.TYPE_INT_RGB)
    }

    private Graphics2D createGraphics(BufferedImage image, Rectangle2D rect) {
        double scale = resolutionDotPerInch / POINTS_IN_INCH
        AffineTransform transform = AffineTransform.getScaleInstance(scale, scale)
        transform.concatenate(AffineTransform.getTranslateInstance(-rect.x, -rect.y))

        Graphics2D graphics = image.createGraphics()
        graphics.setBackground(Color.WHITE)
        graphics.setTransform(transform)
        graphics
    }

    static void main(String[] args) {
        String filePath = './input.pdf'
        def pageIndex = 0
        def region = new Rectangle(70, 472, 498, 289)
        def resolutionForHiDPIScreenRendering = 220 /* dpi */

        PDDocument doc = PDDocument.load(new File(filePath))
        try {
            def renderer = new RegionPDFRenderer(doc, resolutionForHiDPIScreenRendering)
            def image = renderer.renderRect(pageIndex, region)
            ImageIO.write(image, "png", new File("./output/image.png"))
        } finally {
            doc.close()
        }
    }

}

Question:

This works fine so far, but how to add transparency to the generated images ?

for (img <- 0 until f.length) {
    val inputPdf = PDDocument.load(f(img).getPath).getDocumentCatalog.getAllPages.get(0).asInstanceOf[PDPage]

    val outputfile = new File(f(img).getName + ".png")
    ImageIO.write(inputPdf.convertToImage(), "png", outputfile)
}

Best regards Torsten


Answer:

Try using convertToImage(type, resolution) with TYPE_INT_ARGB.

You can peek code of convertToImage: http://codenav.org/code.html?project=/org/apache/pdfbox/pdfbox/1.8.4&path=/Source%20Packages/org.apache.pdfbox.pdmodel/PDPage.java (1.8.4) or https://svn.apache.org/repos/asf/pdfbox/tags/1.8.8/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDPage.java (1.8.8-current)

public BufferedImage convertToImage() throws IOException
{
    //note we are doing twice as many pixels because
    //the default size is not really good resolution,
    //so create an image that is twice the size
    //and let the client scale it down.
    return convertToImage(8, 2 * DEFAULT_USER_SPACE_UNIT_DPI);
}

You probably want to use:

convertToImage(BufferedImage.TYPE_INT_ARGB, 2 * DEFAULT_USER_SPACE_UNIT_DPI);

NOTE: PDF support transparent object. However, as stated by @mkl it is not compatible with pdf reference.

Question:


Answer:

Try to use the append-mode

//creating the PDPageContentStream object
PDPageContentStream contents = new PDPageContentStream(doc, page, AppendMode.APPEND, true);

Edit

TilmanHausherr mentioned

new PDPageContentStream(doc, page, AppendMode.APPEND, true, true);

Thats why

Question:

I've taken a sample Jpeg2000 from the FNordware examples page.

However, when I try to add that image to the PDF:

PDDocument document = new PDDocument();
PDImageXObject pdImage = pdImage = PDImageXObject.createFromFileByContent(
   "samples/relax.jp2", document);
PDPage page = new PDPage(new PDRectangle(pageWidth, pageHeight));
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.drawImage(pdImage, matrix);
contentStream.close();

I get the exception:

Caused by: java.lang.IllegalArgumentException: Image type UNKNOWN not supported: relax.jp2 at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.createFromFileByContent(PDImageXObject.java:313)

The pdfbox dependencies that I have in Maven:

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.12</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>2.0.12</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>jempbox</artifactId>
        <version>1.8.16</version>
    </dependency>       
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>jbig2-imageio</artifactId>
        <version>3.0.2</version>
    </dependency>
    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.4.0</version>
    </dependency>
    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-jpeg2000</artifactId>
        <version>1.3.0</version>
    </dependency>

Am I doing something wrong here? Or there is some problem with PdfBox and/or the samples that I'm using?

Other apache library, Tika, detects this sample file mime type as "image/jp2":

TikaConfig tika = new TikaConfig();
Metadata metadata = new Metadata();
MediaType mimetype = tika.getDetector().detect(
     TikaInputStream.get(new FileInputStream("samples/relax.jp2"), metadata);

Answer:

From pdfbox's documentation: createFromFileByContent "The following file types are supported: jpg, jpeg, tif, tiff, gif, bmp and png."

Looking into the source code, what gets called inside createFromFileByContent is their own check for known file types, independent from the underlying libraries, the detection code looks like this: https://jar-download.com/artifacts/org.apache.pdfbox/pdfbox/2.0.3/source-code/org/apache/pdfbox/util/filetypedetector/FileTypeDetector.java

This check does not recognize jpeg 2000.

Actually createFromFileByExtension might be a better bet:

if ("gif".equals(ext) || "bmp".equals(ext) || "png".equals(ext))
{
    BufferedImage bim = ImageIO.read(file);
    return LosslessFactory.createFromImage(doc, bim);
}

As long as you pretend you have a gif, bmp or png and your ImageIO supports j2k, this might somewhat work. (not tested)

Question:

To find the actual size taken by an image on a PDF, I use PDFBox, and I followed what is described in this SO answer. So basically I call

 // Computes the image actual location and dimensions
 PrintImageLocations renderer = new PrintImageLocations();

 for (int i = 0; i < pageLimit; ++i) {
        PDPage page = pdf.getPage(i);

        renderer.processPage(page);
 }

and the PrintImageLocations() is taken from this PDFBox code example.

Yet with a PDF document that I use for test (generated by GPL Ghostscript 910 (ps2write) from an image found on Wikipedia), the image size reported is 0 x 0 (although the PDF can be imported into Gimp or Libre Office Draw).

So I'd like to know if the code I am currently using is reliable or not to find image size, and what could make it not find the right image size ?

The PDF used for this test can be found here

==========

Edit : Following @Itai comment, it appears that the condition if ("Do".equals(operation)) gets not evaluated because there no such operation is invoked. Consequently the processOperator from the super class is invoked.

The only operations that are invoked are (I added System.err.println("Processing " + operation); before the condition in the overriden processOperator method) :

Processing q Processing cm Processing gs Processing q Processing re Processing W Processing n Processing rg Processing re Processing f Processing cs Processing scn Processing re Processing f Processing Q Processing Q

==========

Any hints appreciated,


Answer:

As you already have found out yourself, the reason for the 0x0 output is that the code from PrintImageLocations as-is cannot find the image at all.

PrintImageLocations does not find the image because it only looks for image uses in the page content and in form XObjects (also nested) used in the page content. In the file at hand, on the other hand, the image is drawn inside a tiling Pattern content which is used to fill an area in the page content.

To allow PDFBox to find this image, we have to extend the PrintImageLocations class a bit to also descent into pattern content streams, e.g. like this:

class PrintImageLocationsImproved extends PrintImageLocations {
    public PrintImageLocationsImproved() throws IOException {
        super();

        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingColorSpace());
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        String operation = operator.getName();
        if (fillOperations.contains(operation)) {
            PDColor color = getGraphicsState().getNonStrokingColor();
            PDAbstractPattern pattern = getResources().getPattern(color.getPatternName());
            if (pattern instanceof PDTilingPattern) {
                processTilingPattern((PDTilingPattern) pattern, null, null);
            }
        }
        super.processOperator(operator, operands);
    }

    final List<String> fillOperations = Arrays.asList("f", "F", "f*", "b", "b*", "B", "B*");
}

(ExtractImageLocations inner class PrintImageLocationsImproved)

The tiling pattern in the document at hand is used as a pattern color for filling, not stroking. Thus, PrintImageLocationsImproved has to register operator listeners for non-stroking color operators to have the fill color correctly updated in the graphics state.

processOperator before delegating to the PrintImageLocations implementation now first checks whether the operator is a fill operation. In that case it inspects the current fill color. If it is a pattern color, processOperator initiates the processTilingPattern handling defined in PDFStreamEngine which starts a nested analysis of the pattern content stream and so eventually lets the PrintImageLocationsImproved find the image.

Using PrintImageLocationsImproved like this

try (   PDDocument document = PDDocument.load(...)    )
{
    PrintImageLocations printer = new PrintImageLocationsImproved();
    int pageNum = 0;
    for( PDPage page : document.getPages() )
    {
        pageNum++;
        System.out.println( "Processing page: " + pageNum );
        printer.processPage(page);
    }
}

(ExtractImageLocations test testExtractLikeHelloWorldImprovedFromTopSecret)

for your PDF file, therefore, will find the image:

Processing page: 1
*******************************************************************
Found image [R8]
position in PDF = 39.0, 102.48 in user space units
raw image size  = 1209, 1640 in pixels
displayed size  = 516.3119, 700.3752 in user space units
displayed size  = 7.1709986, 9.727433 in inches at 72 dpi rendering
displayed size  = 182.14336, 247.0768 in millimeters at 72 dpi rendering
Beware,

this is not not perfect fix, more a proof-of-concept and work-around, as it does neither properly restrict the pattern to the area actually filled nor return multiple finds for an area large enough to require multiple pattern tiles to fill. Nonetheless, it returns an image match for the file at hand..

Question:

I've read all similar questions and answers and I'm still stuck beacuse old questions were for old versions.

I want to replace all images in given PDF with external images.

Here is what I have done so far:

   for(int a=0;a<doc.getNumberOfPages().size();a++){
        PDPage p = doc.getPage(a);
        PDResources resources = p.getResources();
        for (COSName xObjectName : resources.getXObjectNames()) {
            PDXObject xObject = resources.getXObject(xObjectName);
            if (xObject instanceof PDImageXObject) {
                PDImageXObject original_img = ((PDImageXObject) xObject);
                PDImageXObject replacement_img = PDImageXObject.createFromFile(f.getImages().get(a), doc);
            }        
        }
    }

So, I have 2 PDImageXObjects names original_img and replacement_img. Replacement_img has to overwrite original_img.


Answer:

To replace the old image by the new image, one has to set the resource in question to the new image, i.e.

resources.put(xObjectName, replacement_img);

after the instantiation of replacement_img in the OP's code.

Question:

I need to extract images from a PDF and I am doing it via PDFBox (v 1.8.9). It works well the 90% of cases but I have some images that when extracted are saved with black background (or are completely white) even if they look perfectly good in the original pdf. I imagine it is something with those jpgs files. What should I check in the jpgs? I am trying to see If I can upload an example pdf

This is the relevant (quite standard) piece of code...

    String pdfFile = promptForPDFFile(jf, "Select PDF file");
    // Load pdf file
    PDDocument document=PDDocument.load(pdfFile);
    //Get the pdf pages
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    int pagetot = pages.size();

    int pagenum = 1;
    while( iter.hasNext() )
    {
        // Cycle on the pages for the images
        PDPage page = (PDPage)iter.next();

        PDResources resources = page.getResources();
        PDFTextStripper textStripper=new PDFTextStripper(); 
        textStripper.setStartPage(pagenum);
        textStripper.setEndPage(pagenum);
        Map images = resources.getImages();

        // Get page text content and use it as file name
        String pagecontent= textStripper.getText(document); 
        pagecontent = pagecontent.replaceAll("\n", "");
        pagecontent = pagecontent.replaceAll("\r", "");
        if( images != null )
        {
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() )
            {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                File tempdir = new File(tempPath+"/temp/");
                tempdir.mkdirs();

                String name = tempPath+"/temp/"+pagecontent;
                //System.out.println( "Writing image:" + name );


                //Write the image to file
                image.write2file( name );


            }
        }
        pagenum ++;
        if (pagenum % 10 ==0)
        {
            System.out.print("\n--- "+ pagenum +"/"+pagetot);
        }
    }

Thanks in advance


Answer:

I ran ExtractImages.java against the two files you sent me. The problem file has CMYK images, as can be seen with this screenshot from PDFDebugger:

The problem is that the 1.8 version doesn't handle CMYK images properly.

But there's a trick: The images are encoded with the DCTDecode filter, which is JPEG. You have "real JPEGs" in the PDF.

I am able to extract your images properly by using the "-directJPEG" option of that tool, which bypasses the decoding mechanism of PDFBox, and just saves the JPEG files "as is".

Note that while this works nicely with your files, it doesn't work properly if the images have an external colorspace specified in the PDF.

Here's the full source code. See writeJpeg2file() for the raw extraction details.

public class ExtractImages
{
    private int imageCounter = 1;

    private static final String PASSWORD = "-password";
    private static final String PREFIX = "-prefix";
    private static final String ADDKEY = "-addkey";
    private static final String NONSEQ = "-nonSeq";
    private static final String DIRECTJPEG = "-directJPEG";

    private static final List<String> DCT_FILTERS = new ArrayList<String>();

    static
    {
        DCT_FILTERS.add( COSName.DCT_DECODE.getName() );
        DCT_FILTERS.add( COSName.DCT_DECODE_ABBREVIATION.getName() );
    }

    private ExtractImages()
    {
    }

    /**
     * This is the entry point for the application.
     *
     * @param args The command-line arguments.
     *
     * @throws Exception If there is an error decrypting the document.
     */
    public static void main( String[] args ) throws Exception
    {
        ExtractImages extractor = new ExtractImages();
        extractor.extractImages( args );
    }

    private void extractImages( String[] args ) throws Exception
    {
        if( args.length < 1 || args.length > 4 )
        {
            usage();
        }
        else
        {
            String pdfFile = null;
            String password = "";
            String prefix = null;
            boolean addKey = false;
            boolean useNonSeqParser = false;
            boolean directJPEG = false;
            for( int i=0; i<args.length; i++ )
            {
                if( args[i].equals( PASSWORD ) )
                {
                    i++;
                    if( i >= args.length )
                    {
                        usage();
                    }
                    password = args[i];
                }
                else if( args[i].equals( PREFIX ) )
                {
                    i++;
                    if( i >= args.length )
                    {
                        usage();
                    }
                    prefix = args[i];
                }
                else if( args[i].equals( ADDKEY ) )
                {
                    addKey = true;
                }
                else if( args[i].equals( NONSEQ ) )
                {
                    useNonSeqParser = true;
                }
                else if( args[i].equals( DIRECTJPEG ) )
                {
                    directJPEG = true;
                }
                else
                {
                    if( pdfFile == null )
                    {
                        pdfFile = args[i];
                    }
                }
            }
            if(pdfFile == null)
            {
                usage();
            }
            else
            {
                if( prefix == null && pdfFile.length() >4 )
                {
                    prefix = pdfFile.substring( 0, pdfFile.length() -4 );
                }

                PDDocument document = null;

                try
                {
                    if (useNonSeqParser)
                    {
                        document = PDDocument.loadNonSeq(new File(pdfFile), null, password);
                    }
                    else
                    {
                        document = PDDocument.load( pdfFile );

                        if( document.isEncrypted() )
                        {
                            StandardDecryptionMaterial spm = new StandardDecryptionMaterial(password);
                            document.openProtection(spm);
                        }
                    }
                    AccessPermission ap = document.getCurrentAccessPermission();
                    if( ! ap.canExtractContent() )
                    {
                        throw new IOException(
                            "Error: You do not have permission to extract images." );
                    }

                    List pages = document.getDocumentCatalog().getAllPages();
                    Iterator iter = pages.iterator();
                    while( iter.hasNext() )
                    {
                        PDPage page = (PDPage)iter.next();
                        PDResources resources = page.getResources();
                        // extract all XObjectImages which are part of the page resources
                        processResources(resources, prefix, addKey, directJPEG);
                    }
                }
                finally
                {
                    if( document != null )
                    {
                        document.close();
                    }
                }
            }
        }
    }

    public void writeJpeg2file(PDJpeg image, String filename) throws IOException
    {
        FileOutputStream out = null;

        try
        {
            out = new FileOutputStream(filename + ".jpg");
            InputStream data = image.getPDStream().getPartiallyFilteredStream(DCT_FILTERS);
            byte[] buf = new byte[1024];
            int amountRead;
            while ((amountRead = data.read(buf)) != -1)
            {
                out.write(buf, 0, amountRead);
            }
            IOUtils.closeQuietly(data);
            out.flush();
        }
        finally
        {
            if (out != null)
            {
                out.close();
            }
        }
    }

    private void processResources(PDResources resources, String prefix, 
            boolean addKey, boolean directJPEG) throws IOException
    {
        if (resources == null)
        {
            return;
        }
        Map<String, PDXObject> xobjects = resources.getXObjects();
        if( xobjects != null )
        {
            Iterator<String> xobjectIter = xobjects.keySet().iterator();
            while( xobjectIter.hasNext() )
            {
                String key = xobjectIter.next();
                PDXObject xobject = xobjects.get( key );
                // write the images
                if (xobject instanceof PDXObjectImage)
                {
                    PDXObjectImage image = (PDXObjectImage)xobject;
                    String name = null;
                    if (addKey) 
                    {
                        name = getUniqueFileName( prefix + "_" + key, image.getSuffix() );
                    }
                    else 
                    {
                        name = getUniqueFileName( prefix, image.getSuffix() );
                    }
                    System.out.println( "Writing image:" + name );
                    if (directJPEG && "jpg".equals(image.getSuffix()))
                    {
                        writeJpeg2file((PDJpeg) image, name);
                    }
                    else
                    {
                        image.write2file(name);
                    }
                    image.clear(); // PDFBOX-2101 get rid of cache ASAP
                }
                // maybe there are more images embedded in a form object
                else if (xobject instanceof PDXObjectForm)
                {
                    PDXObjectForm xObjectForm = (PDXObjectForm)xobject;
                    PDResources formResources = xObjectForm.getResources();
                    processResources(formResources, prefix, addKey, directJPEG);
                }
            }
        }
        resources.clear();
    }

    private String getUniqueFileName( String prefix, String suffix )
    {
        String uniqueName = null;
        File f = null;
        while( f == null || f.exists() )
        {
            uniqueName = prefix + "-" + imageCounter;
            f = new File( uniqueName + "." + suffix );
            imageCounter++;
        }
        return uniqueName;
    }

    /**
     * This will print the usage requirements and exit.
     */
    private static void usage()
    {
        System.err.println( "Usage: java org.apache.pdfbox.ExtractImages [OPTIONS] <PDF file>\n" +
            "  -password  <password>        Password to decrypt document\n" +
            "  -prefix  <image-prefix>      Image prefix(default to pdf name)\n" +
            "  -addkey                      add the internal image key to the file name\n" +
            "  -nonSeq                      Enables the new non-sequential parser\n" +
            "  -directJPEG                  Forces the direct extraction of JPEG images regardless of colorspace\n" +
            "  <PDF file>                   The PDF document to use\n"
            );
        System.exit( 1 );
    }

}

Question:

I'm trying to detect images in this pdf using PDFBox. The pdf has two blank images, one on the left side (below the text "Put this IN the box") and the other on the right side (below the text "Affix this OUTSIDE the box"). This is the code I'm using to detect the images:

PDPage page = (PDPage) catalog.getAllPages().get(0);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();

PDResources resources = page.getResources();
Map<String, PDXObjectImage> images = resources.getImages();
if(null != images){
        Iterator<String> it = images.keySet().iterator();
        while(it.hasNext()){
            String key = it.next();
            System.out.println("Key >>>>>>>>>>>>>> "+key);
        }
}

I'm able to detect the second image. However, the first image is not being detected. What is the problem? I'm sure the pdf is proper. I created it multiple times, and still I'm facing the same problem. I created the pdf using Sketch.

Thanks.


Answer:

In short

I'm able to detect the second image. However, the first image is not being detected. What is the problem?

Actually the same image resource is used for both on-page images, merely stretched to different dimensions.

In detail

If you look at the content stream of your page, you'll see this at the end:

q
720 0 0 970 832 126 cm
/Im1 Do
Q
q
512 0 0 128 144 968 cm
/Im1 Do
Q

The first four lines draw the image resource Im1 at position 832, 126 stretched to 720 x 970, and the last 4 lines draw the same image resource Im1 at position 144, 968 stretched to 512 x 128.

What to do

Your approach to merely look at the page resources to find on-page images is inappropriate because

  • as you have seen a single image resource may be used multiple times on page stretched to different dimensions,
  • an image resource may not be used at all on a page (e.g. some documents have one big resources dictionary referenced from all pages; for a given page many resources may not be used),
  • images can be inlined into the content stream; your approach would not see these images at all, and
  • form Xobjects or patterns may be displayed on your page which may have images in their own resources respectively; as you only look at image resources contained in the immediate page resources, your approach will not find them either.

A better solution (only failing for inlined and probably patterned images) is presented in the PDFBox sample PrintImageLocations the output of which for your file is

*******************************************************************
Found image [Im1]
position = 832.0, 128.0
size = 360px, 462px
size = 720.0, 970.0
size = 10.0in, 13.472222in
size = 254.0mm, 342.19446mm

*******************************************************************
Found image [Im1]
position = 144.0, 128.0
size = 360px, 462px
size = 512.0, 128.0
size = 7.111111in, 1.7777778in
size = 180.62222mm, 45.155556mm

This sample makes use of the PDFBox PDFStreamEngine to parse the content processed to draw a page.

Question:

I'm using PDFBox 1.7.0 (I do not have a choice for the version due to old version in production server). I am trying to add an image to an existing PDF which has already a logo. When I add the new image, the old one disappears like it is replaced.

// Use for convert mm to dots
// ... 72 dots per inch
static final int DEFAULT_USER_SPACE_UNIT_DPI = 72;
// ... mm -> inch -> dots
static final float MM_TO_UNITS = 1 / (10 * 2.54f) * DEFAULT_USER_SPACE_UNIT_DPI;

/**
 * Add a given image to a specific page of a PDF
 * @param document PDF document to manipulate
 * @param input image inputStream
 * @param pdfpage page number to target
 * @param x image position (en mm)
 * @param y image position (en mm)
 * @param width max width of the image (mm)
 * @param height max height of the image (en mm)
 * @param opacity opacity level of the image (fraction)
 */
void addImageToPage (PDDocument document, InputStream input, int pdfpage, int x, int y, int width, int height, float opacity) throws IOException {
    if (input != null) {
        // Convert inputstream to usable BufferedImage
        BufferedImage tmp_image = ImageIO.read (input);
        // User TYPE_4BYTE_ABGR to fix PDFBox issue with transparent PNG
        BufferedImage image = new BufferedImage (tmp_image.getWidth(), tmp_image.getHeight(), BufferedImage.TYPE_4BYTE_ABGR);
        // Prepare the image
        image.createGraphics().drawRenderedImage (tmp_image, null);
        PDXObjectImage ximage = new PDPixelMap (document, image);
        // Resize the image
        int iWidth = ximage.getWidth();
        int iHeight = ximage.getHeight();
        if (width / height > iWidth / iHeight) {
            ximage.setWidth (Math.round (width * MM_TO_UNITS));
            ximage.setHeight (Math.round ((iHeight * width / iWidth) * MM_TO_UNITS));
        } else {
            ximage.setWidth (Math.round ((iWidth * height / iHeight) * MM_TO_UNITS));
            ximage.setHeight (Math.round (height * MM_TO_UNITS));
        }
        // Retrieve the page to update
        PDPage page = (PDPage)document.getDocumentCatalog().getAllPages().get (pdfpage);
        PDResources resources = page.findResources();
        // Get graphics states
        Map graphicsStates = resources.getGraphicsStates();
        if (graphicsStates == null) {
            graphicsStates = new HashMap();
        }
        // Set graphics states configurations
        PDExtendedGraphicsState extendedGraphicsState = new PDExtendedGraphicsState();
        // Set the opacity of the image
        extendedGraphicsState.setNonStrokingAlphaConstant (opacity);
        graphicsStates.put ("TransparentState", extendedGraphicsState);
        // Restore graphics states
        resources.setGraphicsStates (graphicsStates);
        // Retrieve the content stream
        PDPageContentStream contentStream = new PDPageContentStream (document, page, true, true);
        // Activate transparency options
        contentStream.appendRawCommands ("/TransparentState gs\n");
        contentStream.endMarkedContentSequence();
        // Insert image
        contentStream.drawImage (
            ximage,
            (float) x * MM_TO_UNITS,
            (float) y * MM_TO_UNITS
        );
        // close the stream
        contentStream.close();
    }
}

I expected to have the new image within the page, but the existing image inside the page has disappeared instead of the new one.

Example of used PDF : http://www.mediafire.com/folder/g6p7c2b5ob1c7/PDFBox_issue


Answer:

There are several bugs in 1.7... one I mentioned in a comment (turns out it doesn't affect you), the other one is that the resources does some caching but isn't managed properly… long story short, you need to save and restore your xobject resources like this:

Map<String, PDXObject> xObjectsMap = page.getResources().getXObjects(); // save xobjects
…
PDXObjectImage ximage = new PDPixelMap (document, image);
String imgName = page.getResources().addXObject(ximage, "Im");
cs.drawImage(ximage, 0, 0); // bug happens here, old xobjects gets lost
xObjectsMap.put(imgName, ximage);
page.getResources().setXObjects(xObjectsMap); // restore xobjects

This is really just a workaround… there may be more bad surprises coming. You should not use old versions. They no longer spark joy. You should thank them for their service and then let them go without guilt.

Question:

I've been generating PDFs and adding a logo at the top using a JPEG file. The image in the PDF showed up fine on my local machine (OS X). As soon as I sent the file anywhere else though, the image would no longer appear. I'm using java and PDFBox to generate the PDF.

PDPageContentStream contentStream = ...;
PDDocument document = ...;
PDImageXObject image = PDImageXObject.createFromFile("image.jpg", document);
contentStream.drawImage(image, 50, 700, 250, 67);

Answer:

After upgrading from 1.8.11 to 2.0.0, I've managed to resolve the issue by changing the way the image was added to the PDF.

PDPageContentStream contentStream = ...;
PDDocument document = ...;
InputStream imageStream = this.getClass().getClassLoader().getResourceAsStream("image.jpg");
PDImageXObject image = JPEGFactory.createFromStream(document,imageStream);
contentStream.drawImage(image, 50, 700, 250, 67);

The key seems to be the use of:

JPEGFactory.createFromStream()

instead of:

PDImageXObject.createFromFile()

Question:

I'm trying to add an image to the center of pdf using pdfbox. Below is my code but I'm unable to get the correct position of image in pdf. I followed the following link In PDFBox, how to change the origin (0,0) point of a PDRectangle object? to get the correct position but still image is off from the midpoint position?

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

import javax.imageio.ImageIO;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.util.Matrix;

public class imageAppend {
     public static void main (String[] args){

            File file = new File("...pdf file location");
            PDDocument doc = null;
            try 
            {
                doc = PDDocument.load(file);
                PDImageXObject pdImage = PDImageXObject.createFromFile("image file location", doc);

                PDPage page = doc.getPage(0);
                PDPageContentStream contentStream = new PDPageContentStream(doc, page, PDPageContentStream.AppendMode.APPEND, true);

               float x_pos = page.getCropBox().getWidth();
               float y_pos = page.getCropBox().getHeight();

                float x_adjusted = ( x_pos - w ) / 2;
                float y_adjusted = ( y_pos - h ) / 2;

                Matrix mt = new Matrix(1f, 0f, 0f, -1f, page.getCropBox().getLowerLeftX(), page.getCropBox().getUpperRightY());
            contentStream.transform(mt);
            contentStream.drawImage(pdImage, x_adjusted, y_adjusted, w, h);

                doc.save("new pdf file location");
                doc.close();

            } catch (IOException e) 
            {
                e.printStackTrace();
            }
        }
}

Answer:

I reproduced your problem, with my test data (unfortunately you did not share yours) I get

The fix is simple, I removed the two lines

Matrix mt = new Matrix(1f, 0f, 0f, -1f, page.getCropBox().getLowerLeftX(), page.getCropBox().getUpperRightY());
contentStream.transform(mt);

and now get

For the general case you should also add the coordinates of the lower left corner of the crop box to your x_adjusted and y_adjusted

float x_adjusted = ( x_pos - w ) / 2 + page.getCropBox().getLowerLeftX();
float y_adjusted = ( y_pos - h ) / 2 + page.getCropBox().getLowerLeftY();

(AddImage test method testImageAppendNoMirror)

Question:

I try to add multiple images into a pdf with pdfbox 2.0.8, but currently only one will be added. I have two different images which should be attached to two different acrofields, but only the last one of my list will be added.

This is my test function:

@Test
public void attachBulkImageToField(){

    List<ImageData> data = new ArrayList<>();

    data.add(new ImageData(signatureAusstellerField,signatureAussteller.toPath()));
    data.add(new ImageData(signatureDienstleisterField, signatureDienstleister.toPath()));

    ImageToFieldDrawer imgDrawer = new ImageToFieldDrawer(pdf);
    assertTrue(imgDrawer.drawImageToField(data, Paths.get("d:\\imageBulk.pdf")));

}


public boolean drawImageToField(List<ImageData> data, final Path outPath) {
    try {
        for (ImageData element : data) {
            addImageForField(element.getImagePath(), getAcroFieldWithName(element.getFieldName()));
        }
        savePdf(outPath);
        return true;

    } catch (IOException e) {
        e.printStackTrace();
    } catch (PDFSizeException e) {
        e.printStackTrace();
    }
    return false;
}


private void savePdf(Path outPath) throws IOException {
    pdDocument.save(outPath.toFile());
    pdDocument.close();
}

private void addImageForField(Path signature, AcroField targetField) throws IOException {
    PDPage page = pdDocument.getPage(targetField.getPageNr() - 1);
    DrawImage image = new DrawImage(Files.readAllBytes(signature), 0, 0);
    PDImageXObject pdImage = PDImageXObject.createFromFile(signature.toAbsolutePath().toString(), pdDocument);

    try(PDPageContentStream contentStream = new PDPageContentStream(pdDocument, page)){
        contentStream.drawImage(pdImage, targetField.getX(), targetField.getY(), targetField.getWidth(), targetField.getHeight());
    }
}



public class ImageData {

private String fieldName;
private Path imagePath;

public ImageData(String fieldName, Path imagePath) {
    this.fieldName = fieldName;
    this.imagePath = imagePath;
}

public String getFieldName() {
    return fieldName;
}

public void setFieldName(String fieldName) {
    this.fieldName = fieldName;
}

public Path getImagePath() {
    return imagePath;
}

public void setImagePath(Path imagePath) {
    this.imagePath = imagePath;
}

}


Answer:

You create a content stream for the target page using

PDPageContentStream contentStream = new PDPageContentStream(pdDocument, page)

This constructor is documented as

/**
 * Create a new PDPage content stream. This constructor overwrites all existing content streams
 * of this page.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage) throws IOException

So using this constructor you overwrite all existing content streams of that page! In particular you overwrite any previously added instructions for drawing another image...

You should use a different constructor, e.g.

/**
 * Create a new PDPage content stream. If the appendContent parameter is set to
 * {@link AppendMode#APPEND}, you may want to use
 * {@link #PDPageContentStream(PDDocument, PDPage, PDPageContentStream.AppendMode, boolean, boolean)}
 * instead, with the fifth parameter set to true.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @param appendContent Indicates whether content will be overwritten, appended or prepended.
 * @param compress Tell if the content stream should compress the page contents.
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
                           boolean compress) throws IOException

using AppendMode.APPEND or AppendMode.PREPEND depending on whether the new content should be drawn over or under previously drawn content.

Question:

Good day fellow programmer. I'm new to java, and I need to create an Image from a specific part of a pdf.

Currently, I'm using pdfbox.

Here is my code to create an image from a pdf (it's working but it creates an image of the whole pdf page):

PDDocument document = PDDocument.load(new File(PDFFILE));

            PDFRenderer pdfRenderer = new PDFRenderer(document);
            for (int page = 0; page < document.getNumberOfPages(); ++page)
            {

                BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

                ImageIOUtil.writeImage(bim, path + "-" + (page+1) + ".png", 300);
            }
            document.close();

Problem:

I need to create an image based on the pdf in a specific position (like the rectangle).

Is there a way to do this using pdfbox, or is there another library that can do this? I tried searching but I can't find a solution. Thank you.


Answer:

here is the solution: thanks mkl for the idea

    private void PdfToImage(String PDFFILE){
            try{

                PDDocument document = PDDocument.load(new File(PDFFILE));
                PDPage pd;

                PDFRenderer pdfRenderer = new PDFRenderer(document);
                for (int page = 0; page < document.getNumberOfPages(); ++page)
                {


                  pd = document.getPage(page);
                  pd.setCropBox(new PDRectangle(100, 100,100,100));
                  BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
                  ImageIOUtil.writeImage(bim, outputpath + (page+1) + ".png", 300);

                }
                document.close();
            }catch (Exception ex){
                JOptionPane.showMessageDialog(null, ex.getStackTrace());
            }
        }

Question:

I'm using PDFBox's PDPage::convertToImage to display PDF pages in Java. I'm trying to create click-able areas on the PDF page's image based on COSObjects in the page (namely, AcroForm fields). The problem is the PDF seems to use a completely different coordinate system:

System.out.println(field.getDictionary().getItem(COSName.RECT));

yields

COSArray{[COSFloat{149.04}, COSFloat{678.24}, COSInt{252}, COSFloat{697.68}]}

If I were to estimate the actual dimensions of the field's rectangle on the image, it would be 40,40,50,10 (x,y,width,height). There's no obvious correlation between the two and I can't seem to find any information about this with Google.

How can I determine the pixel position of a PDPage's COSObjects?


Answer:

The pdf coordinate system is not that different from the coordinate system used in images. The only differences are:

  • the y-axis points up, not down
  • the scale is most likely different.

You can convert from pdf coordinates to image coordinates using these formulae:

x_image = x_pdf * width_image / width_page
y_image = (height_pdf - y_pdf) * height_image / height_pdf

To get the page size, simply use the mediabox size of the page that contains the annotation:

PDRectangle pageBounds = page.getMediaBox();

You may have missed the correlation between the array from the pdf and your image coordinate estimates, since a rectangle in pdf is represented as array [x_left, y_bottom, x_right, y_top].

Fortunately PDFBox provides classes that operate on a higher level than the cos structure. Use this to your advantage and use e.g. PDRectangle you get from the PDAnnotation using getRectangle() instead of accessing the COSArray you extract from the field's dictionary.

Question:

I have a list of RenderedImage "List imagesList", now I wanna check whether two objects from the above list are same or not. I tried with comparing two file but not with the list of RenderedImages. Does anyone have any idea how to compare RenderedImage? Do I have to use any library to do this?


Answer:

I'd compare those two objects pixel by pixel. I'm sure it's slow, but I'm sure too it should work.

Related: Java Compare one BufferedImage to Another

Question:

Using below code I am able to create a pdf doc with an image but I would like to place the image on top of an background colour, I tried bit on my side but not able to achieve it can any one help me in achieving this:

public class SimpleTable {

    public static void main(String args[]) throws Exception {
          //Loading an existing document

          PDDocument doc = new PDDocument();  

          PDPage my_page = new PDPage();
          //Retrieving the page
     doc.addPage(my_page);

          //Creating PDImageXObject object
          PDImageXObject pdImage = PDImageXObject.createFromFile("D:\\QRCode.png",doc);

          //creating the PDPageContentStream object
          PDPageContentStream contents = new PDPageContentStream(doc, my_page);
          PDImageXObject
          //Drawing the image in the PDF document
          contents.drawImage(pdImage, 70, 250);
          contents.setNotStrokingColoar(Color.RED);

          System.out.println("Image inserted");

          //Closing the PDPageContentStream object
          contents.close();     

          //Saving the document
          doc.save("D:\\QRCode.pdf");

          //Closing the document
          doc.close();

       }



}

Note: the background-colour occupancy will be same as image size


Answer:

Put this before drawing your image and remove the fill that is in your current code:

contents.setNotStrokingColoar(Color.RED);
contents.addRect(70, 250, pdImage.getWidth(), pdImage.getHeight());
contents.fill();

Be aware that a background color will make it more difficult to scan the QR code because there'll be less contrast.

Question:

I am using the very useful PDFBox to build a simple pdf stamping GUI.

I noticed a serious issue with a particular document however.

When I specify a particular scale factor for the rendering, the expected output image size is different.

What is worse? the scaling factor used for the resultant image along the horizontal axis is different from that along the vertical axis.

Here is the code I used:

/**
 * @param pdfPath The path to the pdf document
 * @param page The pdf page number(is zero based)
 */
public BufferedImage loadPdfImage(String pdfPath, int page) {
    File file = new File(pdfPath);

    try (PDDocument doc = PDDocument.load(file)) {

        pageCount = doc.getNumberOfPages();
        PDPage pDPage = doc.getPage(page);

       float w = pDPage.getCropBox().getWidth();
       float h = pDPage.getCropBox().getHeight();

       System.out.println("Pdf opening: width: "+w+", height: "+h);


        PDFRenderer renderer = new PDFRenderer(doc);

        float dpiRatio =  1.5f;

        BufferedImage img = renderer.renderImage(page, dpiRatio);

 float dpiXRatio = img.getWidth() / w;
 float dpiYRatio = img.getHeight()/ h;


       System.out.println("dpiXRatio: "+dpiXRatio+", dpiYRatio: "+dpiYRatio);

        return img;
    } catch (IOException ex) {
        System.out.println( "invalid pdf found. Please check");
    }

    return null;
}

The code above loads most pdf documents that I have tried it on and converts given pages within them to BufferedImage objects.

For the said document however, it seems to be unable to render the converted image at the supplied scale-factor.

Is there anything wrong with my code? or is it a known bug?

Thanks.

EDIT

I am using PDFBOX v2.0.15

And the page has no rotation.


Answer:

The error was mine; for the most part.

I had used the MediaBox to compute the scale factors and unfortunately the MediaBox and CropBox of the pdf file in question were not the same.

For example:

cropbox-rect: [8.50394,34.0157,586.496,807.984]
mediabox-rect: [0.0,0.0,595.0,842.0]

After making corrections for these, the scale-factors matched better along both axes, save for the errors due to the fact that the image sizes are integer numbers.

This is negligible enough for me to neglect, though.

When stamping, all I had to do was to make the necessary corrections for the cropbox. For example to draw the image(stamp) at P(x,y), I would do:

        x += cropBox.getLowerLeftX();
        y += cropBox.getLowerLeftY();

before calling the draw image functionality.

It all came out fine!

Question:

I'm having trouble with PDFBox. I have a blank page in PDF and I want to insert images into it. Because I also work with signed PDFs, all changes have to be saved as "saveIncremental".

When I insert only one image everything is fine (image has been inserted). When I try to insert another image in this PDF, it has not been inserted and when opened in Adobe Acrobat Reader it says "An error exists on this page. Adobe may not display the page correctly ...".

Weird thing - when PDF is not only blank page but e.g. blank page with image, everything is fine (first and also second image has been inserted correctly with saveIncremental).

Code of inserting and saving image:

PDImageXObject pdImage = PDImageXObject.createFromFile(tmpSig.getFileName(), doc);
PDPageContentStream contentStream = new PDPageContentStream(doc, tmpPage, PDPageContentStream.AppendMode.APPEND, true, true);
contentStream.drawImage(pdImage, finalX, (finalPageHeight - finalY - finalHeight), finalWidth, finalHeight);
contentStream.close();

// update before save
tmpPage.getCOSObject().setNeedToBeUpdated(true);
tmpPage.getResources().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getPages().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);

// save
doc.saveIncremental(new FileOutputStream(pdfFile));

All files available here

Using PDFBox version 2.0.7 but I also tried the newest (2.0.15) but it didn't help.

Thanks for all ideas!


EDIT: I tried to update XObject and Resources as this (added this code under comment "update before save"):

pdImage.getCOSObject().setNeedToBeUpdated(true);
PDResources pdResources = tmpPage.getResources();
for (COSName name : pdResources.getXObjectNames()) {
    pdResources.getXObject(name).getCOSObject().setNeedToBeUpdated(true);
}

Problem still remains, nothing changed...


Answer:

In addition to the dictionaries you already marked as updated

tmpPage.getCOSObject().setNeedToBeUpdated(true);
tmpPage.getResources().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getPages().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);

please also mark the XObject entry in the resources dictionary as updated:

tmpPage.getResources().getCOSObject().getCOSDictionary(COSName.XOBJECT).setNeedToBeUpdated(true);

You wonder why you didn't need to do so when adding the first image?

In the original PDF there is no XObject entry in the resources dictionary yet. Thus, it's generated anew and, therefore, implicitly marked updated.

You wonder why you didn't need to do so when adding to the file which already had images?

In that other file the XObject entry in the resources dictionary is a direct object, i.e. it is immediately contained in the resources dictionary.

4 0 obj
<<
  /Type /Page
  /Resources <<
    /ProcSets [/PDF /Text /ImageB /ImageC /ImageI]
    /ExtGState <</G3 5 0 R /gs2 6 0 R /gs3 7 0 R>>
    /XObject <</Im1 8 0 R /Im2 9 0 R>>
  >>
  /MediaBox [0 0 611.03998 864.95996]
  /Contents [10 0 R 11 0 R 12 0 R 13 0 R 14 0 R]
  /StructParents 0
  /Parent 2 0 R
>> 
endobj

Thus, whenever a new copy of the resources dictionary is written, implicitly a new copy of the XObject entry is written, too.

In the file in which PDFBox created the XObject entry in the resources dictionary, though, PDFBox created it as an indirect object, i.e. in the resources dictionary XObject only maps to a reference to an object number and in the object with that number the actual entry dictionary can be found.

2 0 obj
<<
  /Type /Page
  /Resources <<
    /ProcSets [/PDF /Text /ImageB /ImageC /ImageI]
    /ExtGState <</G3 3 0 R>>
    /XObject 7 0 R
  >>
  /MediaBox [0 0 611.03998 864.95996]
  /Contents [8 0 R 4 0 R 9 0 R]
  /StructParents 0
  /Parent 5 0 R
>>
endobj
7 0 obj
<<
  /Im1 10 0 R
>> 
endobj

So when a new copy of the resources dictionary is written, no implicit new copy of the XObject entry dictionary is written in this case.


As an aside, your current approach won't help you with your task

Because I also work with signed PDFs, all changes have to be saved as "saveIncremental".

Adding images to the page content is not an allowed change to a signed PDF, so Adobe Reader will still indicate your signature is invalid. For a summary of the allowed and disallowed changes after signing, have a look at this answer and documents referenced from it.

You should instead try adding images in annotations.

Question:

I'm trying to add image of user's selection to my pdf generated through pdfbox in netbeans. If i directly give path to directly then it's working but with getting url of image path and adding that doesn't work.

See the given code problem is with URL and Path, Because input isn't getting read

 public static ByteArrayOutputStream PDFGenerator(........,Path imagespath)
  {
    ........
    if (finalpdf.Images != null)
    {
      Path imagepath = Paths.get(imagespath.toString(), "room.png");
      PDImageXObject Addedimage = PDImageXObject.createFromFile(imagepath.toString(), pdf);
      AddImages(content, Addedimage, 229.14f, 9.36f);
    }

    //AddImages method is following
  public static void AddImages(PDPageContentStream content, PDImageXObject image, float x, float y) throws IOException
  {

    content.drawImage(image, x, y);

  }
}

  //Following is snippet from my test method
  public void testClass()
  {
    ........
    finalpdf.Images = "room.png";
    URL imageurl = testclass.class.getResource("room.png");
    Path imagepath = Paths.get(imageurl.getPath().substring(1));
    ByteArrayOutputStream baos = PDFGenerator.generatefurtherpdf(finalpdf, "0000.00", "00.00", imagepath);

    writePDF(baos, "YourPdf.pdf");

  }

I expect that it works this way but i'm sure its some problem with Path, I'm not using this correctly. I hope the code is explanatory enough as i'm quite new also there are security reasons so I can't put the whole code. Sorry for mistakes


Answer:

For resources (never a File) there exists a generalized class: Path.

Path path = Paths.get(imageurl.toURI());

However whenever that path (for instance with an URL ´jar:file//... .jar!... ... .png") will be used as File, which an path.toString() suggests, one can use an InputStream.

The second generalized class is an InputStream which is more low-level:

InputStream in = TestClass.getResourceAsStream(imagepath);

This is a short-cut for the never used getResource().openStream(). Throwing a NullPointerException when the resource path is incorrect.

The last ressort is to use the actual byte[] for createFromByteArray.

byte[] bytes = Files.readAllBytes(path);
PDImageXObject Addedimage = PDImageXObject.createFromByteArray(doc, bytes, name);

Using a temporary file

  Path imagepath2 = Files.createTempFile("room", ".png");
  Files.copy(imagepath, imagepath2);
  PDImageXObject Addedimage = PDImageXObject.createFromFile(imagepath2.toString(), pdf);

Question:

I have been using PDFBox to split pdf files into images for a while now, but after updating to 2.0.19 I have started running into unexpected exceptions.

This is the stack trace of the exception:

java.lang.ArrayIndexOutOfBoundsException: 3
    at java.awt.color.ICC_ColorSpace.toRGB(ICC_ColorSpace.java:191)
    at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.toRGB(PDICCBased.java:350)
    at org.apache.pdfbox.rendering.PageDrawer.getPaint(PageDrawer.java:335)
    at org.apache.pdfbox.rendering.PageDrawer.getNonStrokingPaint(PageDrawer.java:708)
    at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:808)
    at org.apache.pdfbox.contentstream.operator.graphics.FillEvenOddRule.process(FillEvenOddRule.java:37)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
    at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:203)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:190)

Here is the code that I have been using to split the file:

try (PDDocument document = PDDocument.load(new File("updated_test.pdf"))) {
    PDPageTree pdPages = document.getDocumentCatalog().getPages();
    PDFRenderer pdfRenderer = new PDFRenderer(document);

    int page = 0;
    for (PDPage pdPage : pdPages) {
        String fileName = "demo" + page + ".png";

        File tempImg = new File(fileName);

        BufferedImage bim = pdfRenderer.renderImage(page);
        ImageIOUtil.writeImage(bim, tempImg.getAbsolutePath(), 150);

        page++;
    }
} catch (Exception e) {
    e.printStackTrace();
}

And here is the actual file that causes the issue: https://stackoverflowuploads.s3-us-west-2.amazonaws.com/updated_test.pdf

All help, ideas and advice would be greatly appreciated, if you have ideas about other solutions/libraries that can achieve the same results those would be very useful as well. Thank you!


Answer:

This has been fixed in PDFBOX-4801 and a snapshot build is available here at the bottom.

It will be in 2.0.20, which is likely to be released in summer (hopefully).

The cause is an incorrect /N value (3) in the dictionary of a CMYK ICC profile. The correct value should have been 4. This results in the mentioned exception later. The corrected code checks the ICC profile and corrects the value of the PCICCBased object.

Question:

I have an error when i read a page from a PDF document. this page contains a bar code which is done with a font (AAAAAC+Code3de9). this error appear only when i use the renderImage function. I use the 2.0.17 version of pdfbox-app.

*déc. 02, 2019 9:34:13 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
AVERTISSEMENT: Could not read embedded OTF for font AAAAAC+Code3de9
java.io.IOException: Illegal seek position: 2483278652
at org.apache.fontbox.ttf.MemoryTTFDataStream.seek(MemoryTTFDataStream.java:164)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:352)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:112)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:65)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139)
at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:192)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:872)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:506)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:268)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:203)
at patrick.mart1.impose.ImposeKosmedias$1.run(ImposeKosmedias.java:370)
déc. 02, 2019 9:34:13 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 findFontOrSubstitute
AVERTISSEMENT: Using fallback font LiberationSans for CID-keyed TrueType font AAAAAC+Code3de9*

Many thanks for your help


Answer:

This is based on the RemoveAllText.java example from the source code download. It removes the selection of F2 in the content stream, and also removes the font in the resources. It makes the assumption that F2 is not really used, i.e. that there is no text related to F2. Compared to the official example, only "createTokensWithoutText" has been changed. I kept all the names even if the meaning is different, except for the class name.

So this code is really just for this file, or for files generated similarly.

public final class RemoveFontF2
{
    /**
     * Default constructor.
     */
    private RemoveFontF2()
    {
        // example class should not be instantiated
    }

    /**
     * This will remove all text from a PDF document.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 2)
        {
            usage();
        }
        else
        {
            PDDocument document = PDDocument.load(new File(args[0]));
            if (document.isEncrypted())
            {
                System.err.println(
                        "Error: Encrypted documents are not supported for this example.");
                System.exit(1);
            }
            for (PDPage page : document.getPages())
            {
                List<Object> newTokens = createTokensWithoutText(page);
                PDStream newContents = new PDStream(document);
                writeTokensToStream(newContents, newTokens);
                page.setContents(newContents);
                processResources(page.getResources());
            }
            document.save(args[1]);
            document.close();
        }
    }

    private static void processResources(PDResources resources) throws IOException
    {
        for (COSName name : resources.getXObjectNames())
        {
            PDXObject xobject = resources.getXObject(name);
            if (xobject instanceof PDFormXObject)
            {
                PDFormXObject formXObject = (PDFormXObject) xobject;
                writeTokensToStream(formXObject.getContentStream(),
                        createTokensWithoutText(formXObject));
                processResources(formXObject.getResources());
            }
        }
        for (COSName name : resources.getPatternNames())
        {
            PDAbstractPattern pattern = resources.getPattern(name);
            if (pattern instanceof PDTilingPattern)
            {
                PDTilingPattern tilingPattern = (PDTilingPattern) pattern;
                writeTokensToStream(tilingPattern.getContentStream(),
                        createTokensWithoutText(tilingPattern));
                processResources(tilingPattern.getResources());
            }
        }
    }

    private static void writeTokensToStream(PDStream newContents, List<Object> newTokens) throws IOException
    {
        OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter writer = new ContentStreamWriter(out);
        writer.writeTokens(newTokens);
        out.close();
    }

    private static List<Object> createTokensWithoutText(PDContentStream contentStream) throws IOException
    {
        PDFStreamParser parser = new PDFStreamParser(contentStream);
        Object token = parser.parseNextToken();
        List<Object> newTokens = new ArrayList<Object>();
        while (token != null)
        {
            if (token instanceof Operator)
            {
                Operator op = (Operator) token;
                String opName = op.getName();
                if (OperatorName.SET_FONT_AND_SIZE.equals(opName) &&
                    newTokens.get(newTokens.size() - 2).equals(COSName.getPDFName("F2")))
                {
                    // remove the 2 arguments to this operator
                    newTokens.remove(newTokens.size() - 1);
                    newTokens.remove(newTokens.size() - 1);

                    token = parser.parseNextToken();
                    continue;
                }
            }
            newTokens.add(token);
            token = parser.parseNextToken();
        }
        // remove F2
        COSBase fontBase = contentStream.getResources().getCOSObject().getItem(COSName.FONT);
        if (fontBase instanceof COSDictionary)
        {
            ((COSDictionary) fontBase).removeItem(COSName.getPDFName("F2"));
        }
        return newTokens;
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println("Usage: java " + RemoveFontF2.class.getName() + " <input-pdf> <output-pdf>");
    }

}

Question:

I created a PDF using PDFBOX. The entire PDF generates perfectly and even the images loaded while i was using

PDImageXObject ptabelle = PDImageXObject.createFromFile("src/main/resources/pdf/ptabelle.png", pdDocument);

But the project will need to go live sometime so I have to replace the static path with a class loader. After doing all that the PDF generates, the text is displayed, but not the image.

The interesting thing is that inside the PDF the "box" where the image should be is there, but not the image.

Here is the code for the stream generation.

ClassLoader classLoader = getClass().getClassLoader();
PDStream pdStream = new PDStream(pdDocument, classLoader.getResourceAsStream("pdf/ptabelle.png"));
PDResources pdResources = new PDResources();
PDImageXObject ptabelle = new PDImageXObject(pdStream, pdResources);

PDPageContentStream pdPageContentStream = new PDPageContentStream(pdDocument, page4);

And here is the call in the code, the length + width variables are defined in the code.

 pdPageContentStream.drawImage(ptabelle, TEXT_BEGIN, currentYCoord, 172, 107);

Answer:

Instead of new PDImageXObject(pdStream, pdResources) which is for PDFBox internal use, please use the appropriate LosslessFactory method. So your code would look like this:

BufferedImage bim = ImageIO.read(classLoader.getResourceAsStream("pdf/ptabelle.png"));
PDImageXObject img = LosslessFactory.createFromImage(pdDocument, bim);

See also the javadoc of PDImageXObject.createFromFileByExtension, which explains what factory methods can be called instead.

Question:

We occasionally encounter some extremely large PDFs filled with full page, high resolution images (the result of document scanning). For example, I have a 1.7GB PDF with 3500+ images. Loading the document takes about 50s but counting the images takes about 15 minutes.

I'm sure this is because the image bytes are read as a part of the API calls. Is there way to extract the image count without actually reading the image bytes?

PDFBox version: 2.0.2

Example Code:

@Test
public void imageCountIsCorrect() throws Exception {
    PDDocument pdf = readPdf();
    try {
        assertEquals(3558, countImages(pdf));
        // assertEquals(3558, countImagesWithExtractor(pdf));
    } finally {
        if (pdf != null) {
            pdf.close();
        }
    }
}

protected PDDocument readPdf() throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();

    FileInputStream stream = new FileInputStream("large.pdf");
    PDDocument pdf;
    try {
        pdf = PDDocument.load(stream, MemoryUsageSetting.setupMixed(1024 * 1024 * 250));
    } finally {
        stream.close();
    }

    stopWatch.stop();
    log.info("PDF loaded: time={}s", stopWatch.getTime() / 1000);
    return pdf;
}


protected int countImages(PDDocument pdf) throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();

    int imageCount = 0;
    for (PDPage pdPage : pdf.getPages()) {
        PDResources pdResources = pdPage.getResources();
        for (COSName cosName : pdResources.getXObjectNames()) {
            PDXObject xobject = pdResources.getXObject(cosName);
            if (xobject instanceof PDImageXObject) {
                imageCount++;
                if (imageCount % 100 == 0) {
                    log.info("Found image: #" + imageCount);
                }
            }
        }
    }

    stopWatch.stop();
    log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
    return imageCount;
}

If I change the countImages method to rely on the COSName, the count completes in less than 1s but I'm a little uncertain about relying on the prefix of the name. This appears to be a byproduct of the pdf encoder and not PDFBox (I couldn't find any reference to it in their code):

if (cosName.getName().startsWith("QuickPDFIm")) {
    imageCount++;
}

Answer:

So the previous approach had some additional flaws (could miss inline images, etc.). Thanks mkl and Tilman Hausherr for the feedback!

TIL - PDF object streams contain useful operator codes!

My new approach extends PDFStreamEngine and increments an imageCount for every 'Do' (draw object) operator found in the PDF content stream. The image count only takes a few hundred milliseconds with this method:

public class PdfImageCounter extends PDFStreamEngine {
    protected int documentImageCount = 0;

    public int getDocumentImageCount() {
        return documentImageCount;
    }

    public PdfImageCounter() {
        addOperator(new OperatorProcessor() {
            @Override
            public void process(Operator operator, List<COSBase> arguments) throws IOException {
                if (arguments.size() < 1) {
                    throw new MissingOperandException(operator, arguments);
                }
                if (isImage(arguments.get(0))) {
                    documentImageCount++;
                }
            }

            protected Boolean isImage(COSBase base) {
                return (base instanceof COSName) &&
                        context.getResources().isImageXObject((COSName)base);
            }

            @Override
            public String getName() {
                return "Do";
            }
        });
    }
}

Invoke it for each page:

protected int countImagesWithProcessor(PDDocument pdf) throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start();

    PdfImageCounter counter = new PdfImageCounter();
    for (PDPage pdPage : pdf.getPages()) {
        counter.processPage(pdPage);
    }

    stopWatch.stop();
    int imageCount = counter.getDocumentImageCount();
    log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
    return imageCount;
}

Question:

I am using PDFBox to convert PDF documents into images. But on some pages, a small amount of content are not rendered consistently. It seems like that small area is shifting a little bit (1-3 pixels). Here is the code:

document = PDDocument.load(pdfFile);
list = document.getDocumentCatalog().getAllPages();

for (int i = 0; i < list.size(); i++) {
    PDPage temp = list.get(i);
    BufferedImage image = temp.convertToImage(BufferedImage.TYPE_INT_RGB, 150);
    File outputfile = new File(fileName + "_" + (i + 1) + ".png");
    ImageIOUtil.writeImage(image, outputfile.getAbsolutePath(), 150);
    //ImageIO.write(image, "jpg", outputfile);
}

Can anyone advice on how to solve this?

I got the difference by comparing the created image with the expected image pixel by pixel. There are other contents on the page but only a small part is shifting.

The link to the created, diff, expected images are here: https://www.dropbox.com/sh/mii7lo3dsvi0kmx/AABNWZ7lbdgHkSQw4RTm1IDoa?dl=0

Please advice on how to solve it! Thank you!


Answer:

Comparing the images

The critical region for this issue contains two formulas. These regions on the OP's images look like this:

expected formula appearance:

and

created formula appearance

The difference is that the font sizes differ slightly, they are expected larger than they are created.

Comparing the original PDFs in Adobe Acrobat

After the OP provided the source PDFs for those images, an analog image export could be made using Adobe Acrobat. The result:

expected formula appearance, rendered by Acrobat

and

created formula appearance, rendered by Acrobat

Here, too, one can see the difference...

The letters and numbers differ substantially from above because the fonts in question are not embedded, and Acrobat uses a different actual font program.

But one font is embedded, the symbol font with the mathematical signs. So these symbols can be compared more strictly. Furthermore the line in the fraction is generated using vector graphics, independent of font replacement, too. Comparing the placement of these parts in relation to each other, one sees that Acrobat and PDFBox render them identically for the same source (with identical differences for the different source PDFs).

Comparing the original PDFs internally

Indeed, the source PDFs differ. The formulas are drawn using form Xobjects, and while the internals of the page contents of the pages in the two documents otherwise are very similar (some minor changes but font sizes are identical), the font sizes in the Xobjects differ.

So it looks like a post-processor applied to the PDF has some bugs when processing form Xobjects.