Hot questions for Using PDFBox in document

Question:

Pdfbox Merge Document with 1.8.xx as like mergePdf.mergeDocuments() it working fine .now pdfbox version 2.0.0 contain some argument like org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(MemoryUsageSetting arg0) what is MemoryUsageSetting how to use with mergeDocuments.I read as like Merge the list of source documents, saving the result in the destination file. kindly provide some code equivalent to version 2.0.0

public void combine()
    {
        try
        {
        PDFMergerUtility mergePdf = new PDFMergerUtility();
        String folder ="pdf";
        File _folder = new File(folder);
        File[] filesInFolder;
        filesInFolder = _folder.listFiles();
        for (File string : filesInFolder)
        {
            mergePdf.addSource(string);    
        }
    mergePdf.setDestinationFileName("Combined.pdf");
    mergePdf.mergeDocuments();
        }
        catch(Exception e)
        {

        }  
    }

Answer:

According to the javadoc, MemoryUsageSetting controls how memory/temporary files are used for buffering.

The two easiest usages are:

MemoryUsageSetting.setupMainMemoryOnly()

this sets buffering memory usage to only use main-memory (no temporary file) which is not restricted in size.

MemoryUsageSetting.setupTempFileOnly()

this sets buffering memory usage to only use temporary file(s) (no main-memory) which is not restricted in size.

So for you, the call would be

mergePdf.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

or

mergePdf.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());

Or just pass null. This will default to main memory only. That's also what the javadoc tells: memUsageSetting defines how memory is used for buffering PDF streams; in case of null unrestricted main memory is used.

Question:

I have a pdf form made and I'm trying to use pdfBox to fill in the form and print the document. I got it working great for 1 page print jobs but i had to try and modify for multiple pages. Basically it's a form with basic info up top and a list of contents. Well if the contents are larger than what the form has room for I have to make it a multiple page document. I end up with a document with a nice page one and then all the remaining pages are the blank template. What am I doing wrong?

PDDocument finalDoc = new PDDocument();
File template = new File("path/to/template.pdf");

//Declare basic info to be put on every page
String name = "John Smith";
String phoneNum = "555-555-5555";
//Get list of contents for each page
List<List<Map<String, String>>> pageContents = methodThatReturnsMyInfo();

for (List<Map<String, String>> content : pageContents) {
    PDDocument doc = new PDDocument().load(template);
    PDDocumentCatlog docCatalog = doc.getDocumentCatalog();
    PDAcroForm acroForm = docCatalog.getAcroForm();

    acroForm.getField("name").setValue(name);
    acroForm.getField("phoneNum").setValue(phoneNum);

    for (int i=0; i<content.size(); i++) {
        acroForm.getField("qty"+i).setValue(content.get(i).get("qty"));
        acroForm.getField("desc"+i).setValue(content.get(i).get("desc"));
    }

    List<PDPage> pages = docCatalog.getAllPages();
    finalDoc.addPage(pages.get(0));
}

//Then prints/saves finalDoc

Answer:

There are two major issues in you code:

  • The AcroForm element of a PDF is a document level object. You only copy the filled-in template page into finalDoc. Thus, the form fields are added to finalDoc only as annotations of their respective page but they are not added to the AcroForm of finalDoc.

    This is not apparent in Adobe Reader but form filling services often identify available fields from the document level AcroForm entry and don't search the pages for additional form fields.

  • The actual show stopper: You add fields with identical names to the PDF. But PDF forms are document-wide entities. I.e. there can be only a single field entity with a given name in a PDF. (This field entity may have multiple visualizations aka widgets but this requires you to construct a single field object with multiple kid widgets.Furthermore these widgets are expected to display the same value which is not what you want...)

    Thus, you have to rename the fields uniquely before adding them to the finalDoc.

Here a simplified example which works on a template with only one field "SampleField":

byte[] template = generateSimpleTemplate();
Files.write(new File(RESULT_FOLDER,  "template.pdf").toPath(), template);

try (   PDDocument finalDoc = new PDDocument(); )
{
    List<PDField> fields = new ArrayList<PDField>();
    int i = 0;

    for (String value : new String[]{"eins", "zwei"})
    {
        PDDocument doc = new PDDocument().load(new ByteArrayInputStream(template));
        PDDocumentCatalog docCatalog = doc.getDocumentCatalog();
        PDAcroForm acroForm = docCatalog.getAcroForm();
        PDField field = acroForm.getField("SampleField");
        field.setValue(value);
        field.setPartialName("SampleField" + i++);
        List<PDPage> pages = docCatalog.getAllPages();
        finalDoc.addPage(pages.get(0));
        fields.add(field);
    }

    PDAcroForm finalForm = new PDAcroForm(finalDoc);
    finalDoc.getDocumentCatalog().setAcroForm(finalForm);
    finalForm.setFields(fields);

    finalDoc.save(new File(RESULT_FOLDER, "form-two-templates.pdf"));
}

As you see all fields are renamed before they are added to finalForm:

field.setPartialName("SampleField" + i++);

and they are collected in the list fields which finally is added to the finalForm AcroForm:

    fields.add(field);
}
...
finalForm.setFields(fields);

Question:

I have an input stream of a PDF document available to me. I would like to add subject metadata to the document and then save it. I'm not sure how to do this.

I came across a sample recipe here: https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html

However, it is still fuzzy. Below is what I'm trying and places where I have questions

PDDocument doc = PDDocument.load(myInputStream);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
InputStream newXMPData = ...; //what goes here? How can I add subject tag?
PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
catalog.setMetadata( newMetadata );
//does anything else need to happen to save the document??
//I would like an outputstream of the document (with metadata) so that I can save it to an S3 bucket

Answer:

The following code sets the title of a PDF document, but it should be adaptable to work with other properties as well:

public static byte[] insertTitlePdf(byte[] documentBytes, String title) {
    try {
        PDDocument document = PDDocument.load(documentBytes);
        PDDocumentInformation info = document.getDocumentInformation();
        info.setTitle(title);
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        document.save(baos);
        return baos.toByteArray();
    } catch (IOException e) {
        e.printStackTrace();
    }

    return null;
}

Apache PDFBox is needed, so import it to e.g. Maven with:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.6</version>
</dependency>

Add a title with:

byte[] documentBytesWithTitle = insertTitlePdf(documentBytes, "Some fancy title");

Display it in the browser with (JSF example):

<object class="pdf" data="data:application/pdf;base64,#{myBean.getDocumentBytesWithTitleAsBase64()}" type="application/pdf">Document could not be loaded</object>

Result (Chrome):

Question:

First time poster, bear with me...

I have two questions. First, I want to know how to add an image to a PDFBox 2.0 document using a BufferedImage. The question has been asked here: Add BufferedImage to PDFBox document

PDFBox has since excluded the PDJpeg class and the xobject section as a whole.

Second, if someone has already asked this question and it has been answered, but the answer is deprecated; what's the best way to update/the best way to connect these two questions? (I don't have any points so I can't comment).


Answer:

PDFBox has since excluded the PDJpeg class and the xobject section as a whole.

There indeed has been a lot of refactoring (and re-refactoring and re-re-refactoring etc) during the development of version 2, and this refactoring often goes beyond mere package changes. And quite often it is not obvious where some functionality is now.

But a basic functionality like adding a BufferedImage to a document can be counted on not being lost.

There now is the JPEGFactory which provides methods to create image XObjects from a BufferedImage, in particular:

/**
 * Creates a new JPEG Image XObject from a Buffered Image.
 * @param document the document where the image will be created
 * @param image the buffered image to embed
 * @return a new Image XObject
 * @throws IOException if the JPEG data cannot be written
 */
public static PDImageXObject createFromImage(PDDocument document, BufferedImage image)

/**
 * Creates a new JPEG Image XObject from a Buffered Image and a given quality.
 * The image will be created at 72 DPI.
 * @param document the document where the image will be created
 * @param image the buffered image to embed
 * @param quality the desired JPEG compression quality
 * @return a new Image XObject
 * @throws IOException if the JPEG data cannot be written
 */
public static PDImageXObject createFromImage(PDDocument document, BufferedImage image, float quality)

/**
 * Creates a new JPEG Image XObject from a Buffered Image, a given quality and DPI.
 * @param document the document where the image will be created
 * @param image the buffered image to embed
 * @param quality the desired JPEG compression quality
 * @param dpi the desired DPI (resolution) of the JPEG
 * @return a new Image XObject
 * @throws IOException if the JPEG data cannot be written
 */
public static PDImageXObject createFromImage(PDDocument document, BufferedImage image, float quality, int dpi)

Question:

I want to print specific page of pdf file. In example I have pdf with 4 pages, and I want to print third page. I am using Apache PDFBox lib. I tryied to remove my other pages except that one which I want to print, but it now prints all other pages except that which I want to be printed... any help?

There is my code of function I wrote:

void printPDFS(String fileName, int i) throws PrinterException, IOException{
    PrinterJob printJob = PrinterJob.getPrinterJob();
    printJob.getPrintService();          
   // String test = "\\\\192.168.5.232\\failai\\BENDRAS\\DHL\\test2.pdf";
    PrinterJob job = PrinterJob.getPrinterJob();
    job.setPrintService(printJob.getPrintService());
    PDDocument doc = PDDocument.load(fileName);

    for(int j=1;j<=doc.getNumberOfPages();j++){
        if(i!=j)
        {
            doc.removePage(j);                
        }
     }
   doc.silentPrint(job);
}

I've added this line to code: System.out.println(doc.getPageMap());

Console gives me: {13,0=4, 1,0=2, 7,0=3, 27,0=1} what does it mean?


Answer:

Your code doesn't work since you don't take into account that removing pages also changes the indices of the pages at higher indices and decreases the number of pages. Also page indices are 0-based. Remove the pages like this and it should work:

i = Math.max(-1, Math.min(i, doc.getNumberOfPages()));

// remove all pages with indices higher than i
for (int j = doc.getNumberOfPages()-1; j > i; j--) {
    doc.removePage(j);
}

// remove all pages with indices lower than i
for (int j = i-1; j >= 0; j--) {
    doc.removePage(j);
}

or alternatively a bit closer to your implementation:

for(int j=doc.getNumberOfPages()-1; j >= 0; j--){
    if(i!=j)
    {
        doc.removePage(j);                
    }
}

Question:

I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.

Here is the Code

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    for(TextPosition text : textPositions) {


        System.out.println( "String[" + text.getXDirAdj() + "," +
                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
                text.getXScale() + " height=" + text.getHeightDir() + " space=" +
                text.getWidthOfSpace() + " width=" +
                text.getWidthDirAdj() + "]" + text.getUnicode());

    }
}

I am using pdfbox 2.0


Answer:

The last method in which PDFBox' PDFTextStripper class still has text with positions (before it is reduced to plain text) is the method

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException

One should intercept here because this method receives pre-processed, in particular sorted TextPosition objects (if one requested sorting to start with).

(Actually I would have preferred to intercept in the calling method writeLine which according to the names of its parameters and local variables has all the TextPosition instances of a line and calls writeString once per word; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)

Furthermore it is helpful to use a helper class to wrap sequences of TextPosition instances in a String-like class to make code clearer.

With this in mind one can search for the variables like this

List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextPositionSequence word = new TextPositionSequence(textPositions);
            String string = word.toString();

            int fromIndex = 0;
            int index;
            while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
            {
                hits.add(word.subSequence(index, index + searchTerm.length()));
                fromIndex = index + 1;
            }
            super.writeString(text, textPositions);
        }
    };

    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);
    return hits;
}

with this helper class

public class TextPositionSequence implements CharSequence
{
    public TextPositionSequence(List<TextPosition> textPositions)
    {
        this(textPositions, 0, textPositions.size());
    }

    public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
    {
        this.textPositions = textPositions;
        this.start = start;
        this.end = end;
    }

    @Override
    public int length()
    {
        return end - start;
    }

    @Override
    public char charAt(int index)
    {
        TextPosition textPosition = textPositionAt(index);
        String text = textPosition.getUnicode();
        return text.charAt(0);
    }

    @Override
    public TextPositionSequence subSequence(int start, int end)
    {
        return new TextPositionSequence(textPositions, this.start + start, this.start + end);
    }

    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder(length());
        for (int i = 0; i < length(); i++)
        {
            builder.append(charAt(i));
        }
        return builder.toString();
    }

    public TextPosition textPositionAt(int index)
    {
        return textPositions.get(start + index);
    }

    public float getX()
    {
        return textPositions.get(start).getXDirAdj();
    }

    public float getY()
    {
        return textPositions.get(start).getYDirAdj();
    }

    public float getWidth()
    {
        TextPosition first = textPositions.get(start);
        TextPosition last = textPositions.get(end);
        return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
    }

    final List<TextPosition> textPositions;
    final int start, end;
}

To merely output their positions, widths, final letters, and final letter positions, you can then use this

void printSubwords(PDDocument document, String searchTerm) throws IOException
{
    System.out.printf("* Looking for '%s'\n", searchTerm);
    for (int page = 1; page <= document.getNumberOfPages(); page++)
    {
        List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
        for (TextPositionSequence hit : hits)
        {
            TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
            System.out.printf("  Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
                    page, hit.getX(), hit.getY(), hit.getWidth(),
                    lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
        }
    }
}

For tests I created a small test file using MS Word:

The output of this test

@Test
public void testVariables() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("Variables.pdf");
            PDDocument document = PDDocument.load(resource);    )
    {
        System.out.println("\nVariables.pdf\n-------------\n");
        printSubwords(document, "${var1}");
        printSubwords(document, "${var 2}");
    }
}

is

Variables.pdf
-------------

* Looking for '${var1}'
  Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
  Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
  Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
  Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18

* Looking for '${var 2}'
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81

I was a bit surprised because ${var 2} has been found if on a single line; after all, PDFBox code made me assume the method writeString I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...

If you need other data from the grouped TextPosition instances, simply enhance TextPositionSequence accordingly.

Question:

I already have several PDF documents that have been created. What I am attempting to do is by using PDFBox. I need to put text into several places on these created documents but I do NOT want to modify the text that is within those areas. For instance, there may be a a section as follows -

NAME: ______________________________

I will put text into that area, but I need the underline to remain the same length. I believe the best solution would be to just create a textbox or similar that goes above the area so the line remains the same length.

In other words, I do not want to edit the text inline so it will remain the same length. I have no code for this as I am just attempting to understand the pdfbox package. I have been looking for examples online, but most of them just show how to create a document and not how to update a previously document. How do I do this?


Answer:

I found the answer and wanted to share.

In the pdfbox package there is a class called Overlay.

    PDDocument pdfDocument = new Overlay();
    PDDocument final = pdfDocument.overlay(PDDocument firstDoc, PDDocument otherDoc);

firstDoc will be overlaid onto otherDoc. Easy peasy. I just didn't know where to look.

Question:

I have a service which signs the data and provides me with the signed hash, it correctly generates PKCS#7 DigestInfo as stated in rfc2315#section-9.4

Something like this

The code for the above system is : https://pastebin.com/b3qZH6xW

            //prepare signature
        PDSignature signature = new PDSignature();
        signature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
        signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);
        signature.setName("Ankit");
        signature.setLocation("Bhopal, IN");
        signature.setReason("Testing");
        // TODO extract the above details from the signing certificate? Reason as a parameter?

        // the signing date, needed for valid signature
        signature.setSignDate(Calendar.getInstance());

        if (accessPermissions == 0)
        {
            setMDPPermission(document, signature, 3);
        }

        FileOutputStream fos = new FileOutputStream(new File("signed_file.pdf"));

        DetachedPkcs7 detachedPkcs7 = new DetachedPkcs7();
        //populate signature options for visible signature. if any.
        SignatureOptions signatureOptions = null;
        document.addSignature(signature);
        ExternalSigningSupport externalSigning = document.saveIncrementalForExternalSigning(fos);
        InputStream dataToSign = externalSigning.getContent();
        byte[] cmsSignature = detachedPkcs7.sign(dataToSign);
        externalSigning.setSignature(cmsSignature);  

Workflow is something like this - Grab original PDF - add signature dictionary and get the hash - send the hash to client - Wait for data on Standard Input. - Wait for Client to send the signed hash back, This data is then feeded to the paused program, that is, the data is sent to standard input of the program - add the CMS. :)

I have no clue why the PDF generated using this process has the signature shown as invalid.


Answer:

There are at least two problems in the client or the communication with it:

Wrong assumed hash algorithm in DigestInfo structure

The signature value returned by the client, when decrypted using the public key of the signer certificate, contains this DigestInfo structure:

  0  81: SEQUENCE {
  2  13:   SEQUENCE {
  4   9:     OBJECT IDENTIFIER sha-512 (2 16 840 1 101 3 4 2 3)
 15   0:     NULL
       :     }
 17  64:   OCTET STRING
       :     '413140d54372f9baf481d4c54e2d5c7bcf28fd6087000280'
       :     'e07976121dd54af2'
       :   }

In particular it claims that SHA512 has been used to calculate the hash. Nonetheless it contains a digest value which is 32 bytes in length, which therefore cannot be a SHA512 digest value!

So your claim

I have a service which signs the data and provides me with the signed hash, it correctly generates PKCS#7 DigestInfo as stated in rfc2315#section-9.4

either is incorrect or your code communicating with the service feeds it incorrect data.

Thus, please fix your client or client communication component to make them introduce the correct digest algorithm OID into the signed DigestInfo structure.

Wrong hash value

Even if the above OID is corrected, the hash value in it is wrong, the correct SHA256 hash value of the signed ranges of your PDF is

9a75434965d5cf2635eb963752494b408a480effabfca1d87b82e619040dfb4b

Thus, please debug your tool chain to find out where the wrong hash value came from.

Appendix: The structure of the CMS container

Another shortcoming of your solution is that the structure of the generated CMS container is very simple. In particular it does not contain signed attributes at all. While this is allowed by the CMS specification, this is extremely insecure against numerous possible forging attacks. Therefore, hardly any CMS container profile in current specifications considers this kind of signature container valid.

Thus, unless your signed documents are used only in a very controlled environment with organisational measures preventing those forging attacks, their value is effectively nil.

Question:

I have created a Document with iText, and I would like to convert this document (which is saved as a PDF file) to an image. For this I use PDFBox, which wants a PDDocument as input. I use the following code:

@SuppressWarnings("unchecked")
public static Image convertPDFtoImage(String filename) {

    Image convertedImage = null;

    try {

        File sourceFile = new File(filename);
        if (sourceFile.exists()) {

            PDDocument document = PDDocument.load(filename);
            List<PDPage> list = document.getDocumentCatalog().getAllPages();
            PDPage page = list.get(0);

            BufferedImage image = page.convertToImage();

            //Part where image gets scaled to a smaller one
            int width = image.getWidth()*2/3;
            int height = image.getHeight()*2/3;
            BufferedImage scaledImage = new BufferedImage(width, height, BufferedImage.TYPE_INT_ARGB);
            Graphics2D graphics2D = scaledImage.createGraphics();
            graphics2D.setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR);
            graphics2D.drawImage(image, 0, 0, width, height, null);
            graphics2D.dispose();

            convertedImage = SwingFXUtils.toFXImage(scaledImage, null);

            document.close();

        } else {
            System.err.println(sourceFile.getName() +" File not exists");
        }

    } 
    catch (Exception e) {
        e.printStackTrace();
    }

    return convertedImage;
}

At the moment, I load the document from the file which has been saved. But I would like to perform this internally from within Java.

So my question is: how can I convert a Document to a PDDocument?

Any help is greatly appreciated!


Answer:

What you could do is to save the itext file into a ByteArrayOutputStream, convert that one to a ByteArrayInputStream.

Document document = new Document();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfWriter writer = PdfWriter.getInstance(document, baos);
document.open();
document.add(new Paragraph("Hello World!"));
document.close();
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
PDDocument document = PDDocument.load(bais);

Of course the file shouldn't be too big, or you'll have memory problems.

Question:

I want to print this in a pdf created by PDFBOX. It wont let me insert tabs and spaces because the font does not support them. Why is this a problem, and more importantly, how can I fix it?

StudentData student = listOfDebtors.get(j);

contentStream.beginText();
contentStream.setFont(font, 8);
contentStream.newLineAtOffset(xPosition, yPosition);
contentStream.showText("Member #:"+ student.getMembershipNumber() + "\t"
            + "Grade:" + getStudentGradeInSchool(student.getYearGraduate()) + "\t"
            + "Year Joined" + student.getYearJoined() + "\n" 
            + "Name:" + student.getFirstName() + " " + student.getLastName() + "\n"
            + "Amount Owed : $" + student.getAmountOwed()); 

Error shown:

Caused by: java.lang.IllegalArgumentException: No glyph for U+0009 in font Courier
    at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:353)
    at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:283)
    at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:341)
    at fbla.rahulshah.database.dataManipulation.PDFCreator.createDebtorPDF(PDFCreator.java:61)
    at fbla.rahulshah.database.view.MainScreenController.generateDebtReport(MainScreenController.java:114)
    ... 62 more

Answer:

A font is a set of glyphs. There is no such thing as "a TAB glyph". Just imagine yourself typesetting with metal glyphs 100 years ago and some guy (who owns a typewriter) asks you about "the tab glyph".

In a typewriter, hitting TAB means "jump to the next tab position". A font does not know its own position, it only knows the look and the size of its glyphs. Nor does PDF or PDFBox have a concept of "tab positions". PDF or PDFBox aren't text editors.

And even with an editor, blindly hitting TAB won't always making you happy, depending of the length of the text you just wrote. You would have to check your own position first, then think about hitting TAB, or maybe hitting it twice.

What you should do instead is that after writing a data column, you position yourself to the appropriate X position of the next column. With a courier font (fixed with), you can also do this by calculating the length of a string and adding an appropriate count of spaces.

Which brings us to the next part, the missing space. Well, use a different font that has spaces, because there is a space glyph: it looks invisible, but it has a fixed size.

And finally, there's also no such thing as a "newline glyph". Newline is a command. You already use "newLineAtOffset" which should work fine to position yourself. See the answer by mkl on how to do it.

Question:

I am using PDFBox and java to generate a pdf document. The document has several pages with text and images. Every page has the same images in the header and footer. I am currently creating a new PDImageXObject and calling drawImage() with the new object every time I add a new page. The resulting document is very heavy and I suppose it is so because it contains repeated copies of the same image.

What would be the most effective way to do this?. Most probably, pdfbox has a much better way of managing document wide resources. I am new to pdfbox and frankly I could not find documentation or examples about this specific use case.

Many thanks


Answer:

You answered the question yourself. You don't have to call new PDImageXObject every time, once per file is enough. However you'll have to call drawImage. (You could save slightly more space if the header and footer are 100% identical by using a form XObject, but you won't save very much, unless the hearder/footer is very complex).

Question:

I am trying to use PDFBox to create a link i can click to go to another page in the same document.

From this question (How to use PDFBox to create a link that goes to *previous view*?) I see that this should be easy to do, but when i try to do it I get this error: Exception in thread "main" java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object

I am using this code:

//Loading an existing document consisting of 3 empty pages.
    File file = new File("C:\\Users\\Student\\Documents\\MyPDF\\Test_doc.pdf");
    PDDocument document = PDDocument.load(file);
    PDPage page = document.getPage(1);

    PDAnnotationLink link         = new PDAnnotationLink();
    PDPageDestination destination = new PDPageFitWidthDestination();
    PDActionGoTo action           = new PDActionGoTo();

    destination.setPageNumber(2);
    action.setDestination(destination);
    link.setAction(action);
    link.setPage(page);

I am using PDFBox 2.0.13, can anyone give me some guidance on what I'm doing wrong?

Appreciate all answers.


Answer:

First of all, for a local link ("a link i can click to go to another page in the same document"), destination.setPageNumber is the wrong method to use, cf. its JavaDocs:

/**
 * Set the page number for a remote destination. For an internal destination, call 
 * {@link #setPage(PDPage) setPage(PDPage page)}.
 *
 * @param pageNumber The page for a remote destination.
 */
public void setPageNumber( int pageNumber )

Thus, replace

destination.setPageNumber(2);

by

destination.setPage(document.getPage(2));

Furthermore, you forgot to set a rectangle area for the link and you forgot to add the link to the page annotations.

All together:

PDPage page = document.getPage(1);

PDAnnotationLink link         = new PDAnnotationLink();
PDPageDestination destination = new PDPageFitWidthDestination();
PDActionGoTo action           = new PDActionGoTo();

destination.setPage(document.getPage(2));
action.setDestination(destination);
link.setAction(action);
link.setPage(page);

link.setRectangle(page.getMediaBox());
page.getAnnotations().add(link);

(AddLink test testAddLinkToMwb_I_201711)

Question:

I use PDFBox (1.8.12) to print pdf documents from java :

PDDocument pdf = PDDocument.load(new File(args[0]));
PrinterJob job = PrinterJob.getPrinterJob();

PrintRequestAttributeSet attr_set = new HashPrintRequestAttributeSet();
attr_set.add(MediaSizeName.ISO_A4); // <<< supposedly prints in A4 format
attr_set.add(Sides.ONE_SIDED);

PDPageable p = new PDPageable(pdf);


job.setPageable(p);

PrintService ps = null;
for  (PrintService i : PrintServiceLookup.lookupPrintServices(null,null)) {
    if (i.getName().equals(args[1])) {
        ps = i;
    }
}

if (ps == null) {
    try {
        throw new SystemException(ErrorCode.NO_PRINTER_FOUND);
    } catch (SystemException e) {
        e.printStackTrace();
    }
}
else
{

    job.setPrintService(ps);
    job.print(attr_set);

}

The problem is, the printed document has a margin which is cut, and I don't know why. I ran tests, to loop the pdf back into the pdf virtual printer, the seem to be the same, which, I think, means that PDFbox does not process the pdf in a wrong manner.

After further researches, the text on the printed paper seems like magnified, it begins higher, and finishes lower than the original (when I directly print from the concrete printer).

The paper on which I print is A4 format, so i tried to set the format to A4 like I did above, but the problem persists.


Answer:

I found this solution :

PDFPageable p = new PDFPageable(pdf);
PDFPrintable printable = new PDFPrintable(pdf,Scaling.SCALE_TO_FIT);

job.setPageable(p);
job.setPrintable(printable);

to apply the right format. I will edit if I must remove it but for now, I'll just post the whole thing on github : GPierre-Antoine:Print_A4_APACHE_PDFBOX)

If I made a mistake, feel free to fork / comment / help.

Here I did an extract of said code:

package com.pierreantoineguillaume;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.printing.PDFPageable;
import org.apache.pdfbox.printing.PDFPrintable;
import org.apache.pdfbox.printing.Scaling;

import javax.print.PrintService;
import javax.print.PrintServiceLookup;
import javax.print.attribute.HashPrintRequestAttributeSet;
import javax.print.attribute.PrintRequestAttributeSet;
import javax.print.attribute.standard.MediaSizeName;
import java.awt.print.PrinterException;
import java.awt.print.PrinterJob;
import java.io.File;
import java.io.IOException;
import java.util.NoSuchElementException;
import java.util.Optional;

public class BasicPrinter {

    public static void main(String[] args) {
        BasicPrinter basicPrinter = new BasicPrinter();
        String printerName = "name of my printer";
        String filename = "my file to print";

        try {
            Optional<PrintService> printService = basicPrinter.getMatchingPrintService(printerName);
            basicPrinter.printA4(printService.get(), PDDocument.load(new File(filename)));
        } catch (NoSuchElementException e) {
            System.err.println("Could not locate printer " + printerName);
        } catch (PrinterException e) {
            System.err.println("Could not print file because some error occurred during the print job or a compatibility error with the service");
        } catch (IOException e) {
            System.err.println("Could not find file to print");
        }
    }

    public void printA4(PrintService printer, PDDocument documentToPrint) throws PrinterException {

        PrinterJob job = PrinterJob.getPrinterJob();

        job.setPageable(new PDFPageable(documentToPrint));
        job.setPrintable(new PDFPrintable(documentToPrint, Scaling.SCALE_TO_FIT));


        job.setPrintService(printer);
        job.print(getA4Attributes());
    }

    public Optional<PrintService> getMatchingPrintService(String printerName) {
        for (PrintService i : PrintServiceLookup.lookupPrintServices(null, getA4Attributes())) {
            if (i.getName().equals(printerName)) {
                return Optional.of(i);
            }
        }

        return Optional.empty();
    }

    private PrintRequestAttributeSet getA4Attributes() {
        PrintRequestAttributeSet attr_set = new HashPrintRequestAttributeSet();
        attr_set.add(MediaSizeName.ISO_A4);
        return attr_set;
    }
}

Question:

I'm trying to add a page to an existing PDF-Document that I'm performing multiple different actions on before and after the page is supposed to be added.

Currently I open the page at the beginning of the document and write stuff on the first and second page of it. On the second page I add some images aswell. The Stuff that's written on the PDFs is different per PDF and sometimes it's so much stuff that two pages (or sometimes even 3) aren't enough. Now I'm trying to add a third or even fourth page once a certain amount of written text/printed images is on the second page.

Somehow no matter what I do, the third page I want to add doesn't show up in the final document. Here's my code to add the page:

if(doc.getNumberOfPages() < p+1){
    PDDocument emptyDoc = PDDocument.load("./data/EmptyPage.pdf");
    PDPage emptyPage = (PDPage)emptyDoc.getDocumentCatalog().getAllPages().get(0);
    doc.addPage(emptyPage);;
    emptyDoc.close();
}

When I check doc.getNumberOfPages() before, it says 2. Afterwards it says 3. The final document still just has 2 pages. The code after the if-clause contains multiple contentStreams that are supposed to write text on the new page (and on the first and second page).

 contentStream = new PDPageContentStream(doc, (PDPage) allPages.get((int)p), true, true);

In the end, I save the document via

doc.save(tarFolder+nr.get(i)+".pdf");
doc.close();

I've created a whole new project with a class that's supposed to do the exact same thing - add a page to another PDF. This code works perfectly fine and the third page shows up - so it seems like I'm just missing something. My code works perfectly fine for page 1 + 2, we just had the new case that we need a third/fourth page sometimes lately, so I want to integrate that into my main project.

Here's the new project that works:

PDDocument doc = PDDocument.load("D:\\test.pdf");
PDDocument doc2 = PDDocument.load("D:\\EmptyPage.pdf");

List<PDPage> allPages = doc2.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(pageNumber);

doc.addPage(page); 
doc.save("D:\\testoutput.pdf");

What's weird in my main project is that the third page I add gets counted by

"getNumberOfPages()"

but doesn't show up in the final product. The program throws an error if I don't add the page because it tries to write content on the third page.

Any idea what I'm doing wrong?

Thanks in advance!

Edit:

If I add the page at the beginning, when my document is loaded the first time, the page gets added and exists in my final document - like this:

doc = PDDocument.load(config.getFolder("template"));
PDDocument emptyDoc = PDDocument.load("./data/EmptyPage.pdf");
PDPage emptyPage = (PDPage)emptyDoc.getDocumentCatalog().getAllPages().get(0);
doc.addPage(emptyPage);

However, since some documents don't need that extra page, it gets unnecessarily complicated - and I feel like removing the page if it isn't needed isn't really pretty, since I'd like to avoid adding it in the first place. Maybe somebody has an idea now?


Answer:

I found an answer thanks to Tilman Hausherr.

If I move the

emptyDoc.close()

to the end of my code, right after:

doc.save(tarFolder+nr.get(i)+".pdf");
doc.close();

the page shows up in the final document without any issues.

Question:

I try to sign an encrypted PDF document for which the signature is allowed. This document : Encrypted PDF document With the PDFBox 2.0.0 sample code : CreateSignature.java

But I got this exception :

Exception in thread "main" java.lang.NullPointerException
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.computeRevisionNumber(StandardSecurityHandler.java:131)
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.prepareDocumentForEncryption(StandardSecurityHandler.java:335)
    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)
    at org.apache.pdfbox.pdmodel.PDDocument.saveIncremental(PDDocument.java:966)
    at principal.CreateSignature.signDetached(CreateSignature.java:179)
    at principal.CreateSignature.signDetached(CreateSignature.java:154)
    at principal.CreateSignature.main(CreateSignature.java:334)

I don't know the pass of the document but signing is allowed :

What should I do for sign this document?


Answer:

This has been fixed in PDFBox 2.0 RC3 (released today), please try it.

While the answer by Gleb is well-meant, it is not correct, due to the reasons I've written in the issue: https://issues.apache.org/jira/browse/PDFBOX-2729

The solution has the problem that it reads and saves the file first, so it is no longer the same before signing. Another problem is that we must encrypt with the same method than initially done, and with the same encryption key. For AES256, this (internal) encryption key has a random component, even if the user provides the same keys to the API.

Question:

I am trying to populate repeated forms with PDFbox. I am using a TreeMap and populating the forms with individual records. The format of the pdf form is such that there are six records listed on page one and a static page inserted on page two. (For a TreeMap larger than six records, the process repeats). The error Im getting is specific to the size of the TreeMap. Therein lies my problem. I can't figure out why when I populate the TreeMap with more than 35 entries I get this warning:

Apr 23, 2018 2:36:25 AM org.apache.pdfbox.cos.COSDocument finalize WARNING: Warning: You did not close a PDF Document

public class test {
    public static void main(String[] args) throws IOException,         IOException {
    // TODO Auto-generated method stub
    File dataFile = new File("dataFile.csv");
    File fi = new File("form.pdf");
    Scanner fileScanner = new Scanner(dataFile);
    fileScanner.nextLine();
    TreeMap<String, String[]> assetTable = new TreeMap<String, String[]>();
    int x = 0;
    while (x <= 36) {
        String lineIn = fileScanner.nextLine();
        String[] elements = lineIn.split(",");
        elements[0] = elements[0].toUpperCase().replaceAll(" ", "");
        String key = elements[0];
        key = key.replaceAll(" ", "");
        assetTable.put(key, elements);
        x++;
    }
    PDDocument newDoc = new PDDocument();
    int control = 1;
    PDDocument doc = PDDocument.load(fi);
    PDDocumentCatalog cat = doc.getDocumentCatalog();
    PDAcroForm form = cat.getAcroForm();
    for (String s : assetTable.keySet()) {
        if (control <= 6) {
            PDField IDno1 = (form.getField("IDno" + control));
            PDField Locno1 = (form.getField("locNo" + control));
            PDField serno1 = (form.getField("serNo" + control));
            PDField typeno1 = (form.getField("typeNo" + control));
            PDField maintno1 = (form.getField("maintNo" + control));
            String IDnoOne = assetTable.get(s)[1];
            //System.out.println(IDnoOne);
            IDno1.setValue(assetTable.get(s)[0]);
            IDno1.setReadOnly(true);
            Locno1.setValue(assetTable.get(s)[1]);
            Locno1.setReadOnly(true);
            serno1.setValue(assetTable.get(s)[2]);
            serno1.setReadOnly(true);
            typeno1.setValue(assetTable.get(s)[3]);
            typeno1.setReadOnly(true);
            String type = "";
            if (assetTable.get(s)[5].equals("1"))
                type += "Hydrotest";
            if (assetTable.get(s)[5].equals("6"))
                type += "6 Year Maintenance";
            String maint = assetTable.get(s)[4] + " - " + type;
            maintno1.setValue(maint);
            maintno1.setReadOnly(true);
            control++;
        } else {
            PDField dateIn = form.getField("dateIn");
            dateIn.setValue("1/2019 Yearlies");
            dateIn.setReadOnly(true);
            PDField tagDate = form.getField("tagDate");
            tagDate.setValue("2019 / 2020");
            tagDate.setReadOnly(true);
            newDoc.addPage(doc.getPage(0));
            newDoc.addPage(doc.getPage(1));
            control = 1;
            doc = PDDocument.load(fi);
            cat = doc.getDocumentCatalog();
            form = cat.getAcroForm();
        }
    }
    PDField dateIn = form.getField("dateIn");
    dateIn.setValue("1/2019 Yearlies");
    dateIn.setReadOnly(true);
    PDField tagDate = form.getField("tagDate");
    tagDate.setValue("2019 / 2020");
    tagDate.setReadOnly(true);
    newDoc.addPage(doc.getPage(0));
    newDoc.addPage(doc.getPage(1));
    newDoc.save("PDFtest.pdf");
    Desktop.getDesktop().open(new File("PDFtest.pdf"));

}

I cant figure out for the life of me what I'm doing wrong. This is the first week I've been working with PDFbox so I'm hoping its something simple.

Updated Error Message

WARNING: Warning: You did not close a PDF Document
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
    at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:77)
    at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:125)
    at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1200)
    at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:383)
    at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:158)
    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522)
    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460)
    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444)
    at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096)
    at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419)
    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367)
    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)
    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232)
    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1204)
    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1192)
    at test.test.main(test.java:87)

Answer:

The warning by itself

You appear to get the warning wrong. It says:

Warning: You did not close a PDF Document

So in contrast to what you think, "PDFbox saying PDDocument closed when its not", PDFBox says that you did not close a document!

After your edit one sees that it actually says that a COSStream has been closed and that a possible cause is that the enclosing PDDocument already has been closed. This is a mere possibility!

The warning in your case

That been said, by adding pages from one document to another you probably end up having references to those pages from both documents. In that case in the course of closing both documents (e.g. automatically via garbage collection), the second one closing may indeed stumble across some already closed COSStream instances.

So my first advice to simply do close the documents at the end by

doc.close();
newDoc.close();

probably won't remove the warnings, merely change their timing.

Actually you don't merely create two documents doc and newDoc, you even create new PDDocument instances and assign them to doc again and again, in the process setting the former document objects in that variable free for garbage collection. So you eventually have a big bunch of documents to be closed as soon as not referenced anymore.

I don't think it would be a good idea to close all those documents in doc early, in particular not before saving newDoc.

But if your code will eventually be run as part of a larger application instead of as a small, one-shot test application, you should collect all those PDDocument instances in some Collection and explicitly close them right after saving newDoc and then clear the collection.

Actually your exception looks like one of those lost PDDocument instances has already been closed by garbage collection, so you should collect the documents even in case of a simple one-shot utility to keep them from being GC disposed.

(@Tilman, please correct me if I'm wrong...)

Importing pages

To prevent problems with different documents sharing pages, you can try and import the pages to the target document and thereafter add the imported page to the target document page tree. I.e. replace

newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));

by

newDoc.addPage(newDoc.importPage(doc.getPage(0)));
newDoc.addPage(newDoc.importPage(doc.getPage(1)));

This should allow you to close each PDDocument instance in doc before losing it. There are certain drawbacks to this, though, cf. the method JavaDoc and this answer here.

An actual issue in your code

In your combined document you will have many fields with the same name (at least in case of a sufficiently high number of entries in your CSV file) which you initially set to different values. And you access the fields from the PDAcroForm of the respective original document but don't add them to the PDAcroForm of the combined result document.

This is asking for trouble! The PDF format does consider forms to be document-wide with all fields referenced (directly or indirectly) from the AcroForm dictionary of the document, and it expects fields with the same name to effectively be different visualizations of the same field and therefore to all have the same value.

Thus, PDF processors might handle your document fields in unexpected ways, e.g.

  • by showing the same value in all fields with the same name (as they are expected to have the same value) or
  • by ignoring your fields (as they are not in the document AcroForm structure).

In particular programmatic reading of your PDF field values will fail because in that context the form is definitively considered document-wide and based in AcroForm. PDF viewers on the other hand might first show your set values and make look things ok.

To prevent this you should rename the fields before merging. You might consider using the PDFMergerUtility which does such a renaming under the hood. For an example usage of that utility class have a look at the PDFMergerExample.

Question:

Essentially, I am attempting to create a small tool in Java where I take the text from some kind of user input, think of an ordinary text box, and create a PDF file with it.

So far, I managed to scrape something really quickly with my barebones knowledge of PDFBox.

In my application, I am instantiating this class(the one shown below) in another one with GUI elements and if I input text, in let's say a text box, and running this PDFLetter script once - it works like a charm but running it a second time, it crashes and gives me this annoying error:

COSStream has been closed and cannot be read. Perhaps it's enclosing PDDocument has been closed?

I do not really see any way that I could trigger this error in my code. I thought it had something to do with my rudimentary 'jump to next page' solution but it works in its current state, so I do not know what to believe anymore.

The way I am instantiating the class, in case you need to know, is like this:

PDFLetter.PDFLetterGenerate(textInput.getValue().toString());   

Additionally, I thought it had to be some kind of a problem with garbage collection that triggered the problem but I no longer think that this is the case.

public class PDFLetter {
    private static final int PAGE_MARGIN = 80;

    static float TABLE_HEIGHT;  

    static Boolean newPage = false;

public static String text = // null;
        "Ding Dong ding dong Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et "
        + "Imperdiet dui accumsan sit amet. Risus in hendrerit gravida rutrum quisque non tellus orci ac.";

static List<String> textList = new ArrayList<String>();

PDDocument document = new PDDocument();
static PDPage main_page = new PDPage();
static PDPage new_page = new PDPage(); 

static File file = new File("C:/PDFTests/temp.pdf"); 

public void PDFLetterGenerate (String args) throws Exception {
    text = args;
    text = text.replace("\n", "").replace("\r", "");

    if(file.exists()) file.delete();
    file.createNewFile();
    //Creating PDF document object 
    PDDocument document = new PDDocument();       
    document.addPage(main_page);  
    mainBody(document, main_page);      
    document.addPage(new_page);                          
    if(!newPage) document.removePage(new_page);

    document.save(file);
    document.close(); 
}

public static void mainBody(PDDocument doc, PDPage page) throws Exception { 
        final float width = page.getMediaBox().getWidth()-(2*PAGE_MARGIN);
        int fontSize = 11;
        float leading = 1.5f * fontSize;
        final float max = 256;
        PDFont pdfFont = PDType1Font.HELVETICA;

        @SuppressWarnings("deprecation")
        PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);

        int lastSpace = -1;
        while (text.length() > 0){
            int spaceIndex = text.indexOf(' ', lastSpace + 1);
            if (spaceIndex < 0) spaceIndex = text.length();
            String subString = text.substring(0, spaceIndex);
            float size = fontSize * pdfFont.getStringWidth(subString) / 1000;

            if (size > width){
                if (lastSpace < 0) lastSpace = spaceIndex;
                subString = text.substring(0, lastSpace);
                textList.add(subString);
                text = text.substring(lastSpace).trim();
                lastSpace = -1;
            }

            else if (spaceIndex == text.length()){
                textList.add(text);
                text = "";
            }

            else{
                lastSpace = spaceIndex;
            }
        }

        contentStream.beginText();
        contentStream.setFont(pdfFont, fontSize);
        contentStream.newLineAtOffset(PAGE_MARGIN, TABLE_HEIGHT);


        @SuppressWarnings("deprecation")
        PDPageContentStream newStream = new PDPageContentStream(doc, new_page, true, true);

        int nextPage_i = 0;

        for (int i=0; i<textList.size(); i++)//String line: textList){
            System.out.println("HEIGHT: "+ TABLE_HEIGHT);
            nextPage_i = i;
            String line = textList.get(i);
            float charSpacing = 0;

            if (line.length() > 1){         
                float size = fontSize * pdfFont.getStringWidth(line) / 1000;
                float free = width - size;
                if (free > 0){
                    charSpacing = free / (line.length() - 1);
                }
                TABLE_HEIGHT = TABLE_HEIGHT - 10;
            }

            contentStream.setCharacterSpacing(charSpacing); 
            contentStream.showText(line);
            contentStream.newLineAtOffset(0, -leading);

            if(TABLE_HEIGHT <= 280){  
                contentStream.endText();  
                contentStream.close(); 
                newPage = true;
                break; 
            } 
        } 

        if(!newPage){
            contentStream.endText();  
            contentStream.close(); 
        }

        else if (newPage){          
            float NEW_HEIGHT = 600;             
            newStream.beginText();
            newStream.setFont(pdfFont, fontSize);
            newStream.newLineAtOffset(PAGE_MARGIN, NEW_HEIGHT);

            for (int j=nextPage_i; j<textList.size(); j++)//String line: textList){
                System.out.println("HEIGHT: "+ NEW_HEIGHT);
                nextPage_i = j;
                String line = textList.get(j);
                float charSpacing = 0;

                if (line.length() > 1){         
                    float size = fontSize * pdfFont.getStringWidth(line) / 1000;
                    float free = width - size;
                    if (free > 0)
                    {
                        charSpacing = free / (line.length() - 1);
                    }
                    NEW_HEIGHT = NEW_HEIGHT - 10; 
                }
                newStream.setCharacterSpacing(charSpacing); 
                newStream.showText(line);
                newStream.newLineAtOffset(0, -leading);
            }
            newStream.endText();  
            newStream.close();
        }
        lastSpace = -1;
    }

Answer:

Pull the PDPage instantiations into PDFLetterGenerate:

public void PDFLetterGenerate (String args) throws Exception {
    PDPage main_page = new PDPage();
    PDPage new_page = new PDPage(); 

    text = args;
    text = text.replace("\n", "").replace("\r", "");

    if(file.exists()) file.delete();
    file.createNewFile();
    //Creating PDF document object 
    PDDocument document = new PDDocument();       
    document.addPage(main_page);  
    mainBody(document, main_page);      
    document.addPage(new_page);                          
    if(!newPage) document.removePage(new_page);

    document.save(file);
    document.close(); 
}

In your code the pages are instantiated once and the underlying streams are closed after the first run of PDFLetterGenerate when the local PDDocument document is closed after having the pages added to it.

Furthermore, also make new_page an argument of mainBody instead of counting on a static variable to hold it.

There are a number of other issues in your code, but the changes above should get you started.

Question:

Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument? Somewhere in the application I have a list of PDDocuments. These documents get merged into one new PDDocument:

PDFMergerUtility pdfMerger = new PDFMergerUtility();

PDDocument mergedPDDocument = new PDDocument();
for (PDDocument pdfDocument : documentList) {
    pdfMerger.appendDocument(mergedPDDocument, pdfDocument);
}

Then this PdDocument gets split into bundles of 10:

Splitter splitter = new Splitter();
splitter.setSplitAtPage(bundleSize);
List<PDDocument> bundleList = splitter.split(mergedDocument);

My question with this is now: if I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?

Also, if you have a PDPage object, can you get information from it like, it's pagenumber, ....? Or can you get this via another way?


Answer:

  1. Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument?

Unfortunately the PDPage does not contain a reference to its parent PDDocument, but it has a list of all other pages in the document that can be used to navigate between pages without a reference to the parent PDDocument.

  1. If you have a PDPage object, can you get information from it like its page number, or can you get this via another way?

There is a workaround to get information about the position of a PDPage in the document without the PDDocument available. Each PDPage has a dictionary with information about the size of the page, resources, fonts, content, etc. One of these attributes is called Parent, this is an array of Pages that have all the information needed to create a shallow clone of the PDPage using the constructor PDPage(COSDictionary). The pages are in the correct order so the page number can be obtain by the position of the record in the array.

  1. If I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?

Once you merge the document list into a single document all references to the original documents will be lost. You can confirm this by looking at the Parent object inside the PDPage, go to Parent > Kids > COSObject[n] > Parent and see if the number for Parent is the same for all the elements in the array. In this example Parent is COSName {Parent} : 1781256139; for all pages.

COSName {Parent} : COSObject {
  COSDictionary {
    COSName {Kids} : COSArray {
      COSObject {
        COSDictionary {
          COSName {TrimBox} : COSArray {0; 0; 612; 792;};
          COSName {MediaBox} : COSArray {0; 0; 612; 792;};
          COSName {CropBox} : COSArray {0; 0; 612; 792;};
          COSName {Resources} : COSDictionary {
            ...
          };
          COSName {Contents} : COSObject {
            ...
          };
          COSName {Parent} : 1781256139;
          COSName {StructParents} : COSInt {68};
          COSName {ArtBox} : COSArray {0; 0; 612; 792; };
          COSName {BleedBox} : COSArray {0; 0; 612; 792; };
          COSName {Type} : COSName {Page};
        }
    }

    ...

    COSName {Count} : COSInt {4};
    COSName {Type} : COSName {Pages};
  }
};

Source code

I wrote the following code to show how the information from the PDPage dictionary can be used to navigate the pages back and forward and get the page number using the position in the array.

public class PDPageUtils {
    public static void main(String[] args) throws InvalidPasswordException, IOException {
        System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");

        PDDocument document = null;
        try {
            String filename = "src/main/resources/pdf/us-017.pdf";
            document = PDDocument.load(new File(filename));

            System.out.println("listIterator(PDPage)");
            ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
            while (pageIterator.hasNext()) {
                PDPage page = pageIterator.next();
                System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * Returns a <code>ListIterator</code> initialized with the list of pages from
     * the dictionary embedded in the specified <code>PDPage</code>. The current
     * position of this <code>ListIterator</code> is set to the position of the
     * specified <code>PDPage</code>.
     * 
     * @param page the specified <code>PDPage</code>
     * 
     * @see {@link java.util.ListIterator}
     * @see {@link org.apache.pdfbox.pdmodel.PDPage}
     */
    public static ListIterator<PDPage> listIterator(PDPage page) {
        List<PDPage> pages = new LinkedList<PDPage>();

        COSDictionary pageDictionary = page.getCOSObject();
        COSDictionary parentDictionary = pageDictionary.getCOSDictionary(COSName.PARENT);
        COSArray kidsArray = parentDictionary.getCOSArray(COSName.KIDS);

        List<? extends COSBase> kidList = kidsArray.toList();
        for (COSBase kid : kidList) {
            if (kid instanceof COSObject) {
                COSObject kidObject = (COSObject) kid;
                COSBase type = kidObject.getDictionaryObject(COSName.TYPE);
                if (type == COSName.PAGE) {
                    COSBase kidPageBase = kidObject.getObject();
                    if (kidPageBase instanceof COSDictionary) {
                        COSDictionary kidPageDictionary = (COSDictionary) kidPageBase;
                        pages.add(new PDPage(kidPageDictionary));
                    }
                }
            }
        }
        int index = pages.indexOf(page);
        return pages.listIterator(index);
    }
}

Sample output

In this example the PDF document has 4 pages and the iterator was initialized with the first page. Notice that the page number is the previousIndex()

System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
    PDPage page = pageIterator.next();
    System.out.println("page #: " + pageIterator.previousIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 0, Structural Parent Key: 68
page #: 1, Structural Parent Key: 69
page #: 2, Structural Parent Key: 70
page #: 3, Structural Parent Key: 71

You can also navigate backwards by starting from the last page. Notice now that the page number is the nextIndex().

ListIterator<PDPage> pageIterator = listIterator(document.getPage(3));
pageIterator.next();
while (pageIterator.hasPrevious()) {
    PDPage page = pageIterator.previous();
    System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 3, Structural Parent Key: 71
page #: 2, Structural Parent Key: 70
page #: 1, Structural Parent Key: 69
page #: 0, Structural Parent Key: 68

Question:

The Problem: I have a large folder with many subfolders with many pdfs in them. Some of them already have OCR on them. Some of them don't. So i wanted to write a Java Program to filter the non OCR PDFs out and copy them to a hot folder.

I tested like 20 Documents and what they all have in common is, that if you open them with editor, you can find the word 'font' and the OCR ones and you cant find it in the non OCR ones. My Question now is: How do i implement this check using PDFbox 2.0.0 ? All the solutions i found dont seem to work only with older versions. And I'm not capable of finding a solution in the documentation. (which is clearly my fault)

Thanks in advance.


Answer:

Here's how to find out if fonts are on the top level of a page:

    PDDocument doc = PDDocument.load(new File(...));
    PDPage page = doc.getPage(0); // 0 based
    PDResources resources = page.getResources();
    for (COSName fontName : resources.getFontNames())
    {
        System.out.println(fontName.getName());
    }
    doc.close();

Re: mkl suggestion, here's how to extract text:

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(1); // 1 based
    stripper.setEndPage(1);
    String extractedText = stripper.getText(doc);
    System.out.println(extractedText);

Question:

I could like to validate the pdf that was created(not as a file) but as ByteArrayOutputStream which is downloaded to browser . In order to avoid security issue could like to validate using pdfbox preflightparser where it has option only for parsing file not PDDocument.

ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
doc.save(byteArrayOutputStream);
PreflightParser parser = new PreflightParser(doc);

//this constructor accepts only file.

Expectation is validate pdf file on the fly instead of loading from system.


Answer:

You can also pass a DataSource. To facilitate this, use org.apache.pdfbox.io.IOUtils.ByteArrayDataSource whose constructor accepts an InputStream.

Question:

do you only need to close a PDDocument after a load/save operation or with every new PDDocument object that is made(fe. when doing merge/split/... operations)?

Let's say, for example, that I have 3 PDDocuments which I loaded from bytearrays:

PDDocument pdf1 = PDDocument.load(bytearray1);
PDDocument pdf2 = PDDocument.load(bytearray2);
PDDocument pdf3 = PDDocument.load(bytearray3);

Say that I then merge these 3 pdfs into one PDDocument:

PdfMergerUtility merger = new PdfMergerUtility();
PDDocument mergedPdf = new PDDocument();
mergedDocument.appendDocument(mergedPdf, pdf1);
mergedDocument.appendDocument(mergedPdf, pdf2);
mergedDocument.appendDocument(mergedPdf, pdf3);

I close the 3 pdfs:

pdf1.close();
pdf2.close();
pdf3.close();

But must I now also close the mergedPdf or is this not needed?

mergedPdf.close(); // is this needed?

Answer:

You should close all PDDocument objects when done with them, also mergedPdf. This is a good practice and it avoids memory leaks. Where possible, use try-with-resources.

Question:

I use the following Java Code to print a PDF Document:

PrinterJob job = PrinterJob.getPrinterJob();
job.setPrintService(printer);
File file = new File(fileName);
PDDocument doc = PDDocument.load(file);
PDFPageable pageable = new PDFPageable(doc);
job.setPageable(pageable);
System.out.println("-- before printing --");
job.print();
System.out.println("-- after printing --");
doc.close();

the output on console is:

-- before printing --
Aug 03, 2018 12:05:09 PM org.apache.pdfbox.cos.COSDocument finalize
WARNUNG: Warning: You did not close a PDF Document
-- after printing --

Why do I get this warning?


Answer:

As discussed in the comments - the problem went away by updating to the latest jdk version, here: from JDK 1.8.0_131 to 1.8.0_181. I can only suspect that the older jdk had a bug that objects were prematurely marked as "unused" and thus were finalized.

Question:

I have seen how to do this in previous versions like below:

How to extract font styles of text contents using pdfbox?

But I think the getFonts() method has been removed now. I want to retrieve a map of texts to fonts (Map<String, PDFont>) in the new version of PDFBox but I have no idea how.

Thanks

Kabeer


Answer:

Do this:

PDDocument doc = PDDocument.load("C:/mydoc3.pdf");
for (int i = 0; i < doc.getNumberOfPages(); ++i)
{
    PDPage page = doc.getPage(i);
    PDResources res = page.getResources();
    for (COSName fontName : res.getFontNames())
    {
        PDFont font = res.getFont(fontName);
        // do stuff with the font
    }
}

Question:

I'm completely newbie to PDFBox and I'm having an issue I can't find the way to solve by the moment.

I get from my database a list of folder and documents located in those folders, I iterate over all these data to generate an index with active links to the correct folder/document path. (Imagine I have a root folder and I want to have a pdf index in that root with relative links to all folders and documents contained in it)

The main code looks like follows:

    try {
        PDDocument document = new PDDocument();
        PDPage page = new PDPage();

        document.addPage(page);

        PDFont font = PDType1Font.HELVETICA;


        PDPageContentStream contentStream = new PDPageContentStream(document, page);
        contentStream.beginText();
        contentStream.setFont(font, 12);
        contentStream.moveTextPositionByAmount(100, 700);
        contentStream.drawString("├Źndice " + expediente.getNombre());
        contentStream.endText();


        int i = 0;
        for (Folder currFol : root.getFolders()) {
            for (Document currDoc : currFol.getDocuments()) {
                i++;

                float x = 50;
                float y = 250;
                String text = currFol.getName() + "/" + currDoc.getName();
                float textWidth = font.getStringWidth(text) / 1000 * 12;

                PDAnnotationLink link = new PDAnnotationLink();
                PDGamma colourBlue = new PDGamma();
                colourBlue.setB(1);
                link.setColour(colourBlue);

                // add an action
                PDActionURI action = new PDActionURI();
                action.setURI(currFol.getName() + "/" + currDoc.getName());

                link.setAction(action);

                contentStream.beginText();
                contentStream.setFont(font, 12);
                contentStream.moveTextPositionByAmount(x, y);
                contentStream.drawString(text);
                contentStream.endText();

                PDRectangle position = new PDRectangle();
                position.setLowerLeftX(x);
                position.setLowerLeftY(y -(i* 5));
                position.setUpperRightX(x + textWidth);
                position.setUpperRightY(y + 50);
                link.setRectangle(position);

                page.getAnnotations().add(link);
            }
        }
        // Make sure that the content stream is closed:

        contentStream.close();

        document.save(output);

        document.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

My problem here is that all elements are printed ovelaped, text and boxes are over each other and by the moment can't find out how to print correctly all links in a list formatted style to create the index.

Any idea or suggestion will be very apreciated.

I tried following some tutorials, with no succes by the moment, like http://www.programcreek.com/java-api-examples/index.php?api=org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink, http://www.massapi.com/class/pd/PDAnnotationLink.html .

I tried without PDRectangle, just the text link (as PDRectangle is present in all examples I found but I don't need it really)

Thank you,


Answer:

In your inner loop you always set x and y to the same values

float x = 50;
float y = 250;

which you don't change thereafter. Then you draw the respective text starting at x, y

contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.moveTextPositionByAmount(x, y);
contentStream.drawString(text);
contentStream.endText();

Thus, you draw each entry starting at the same coordinates. So it is not surprising all the text overlaps.

Furthermore you set position and size of the links like this:

PDRectangle position = new PDRectangle();
position.setLowerLeftX(x);
position.setLowerLeftY(y -(i* 5));
position.setUpperRightX(x + textWidth);
position.setUpperRightY(y + 50);
link.setRectangle(position);

Thus, the upper y coordinate of each link is always y + 50, i.e. 300, and the lower y coordinate of the links moves down by 5 per iteration. So the vertical extent of your first link is contained in that of the second which in turn is contained in that of the third etc etc etc. Again no surprise that these annotations overlap.

Thus, this has nothing to do with being a PDFBox newbie but merely with getting one's coordinates right... ;)


How about something like this instead:

float x = 50;
float y = 650;

for (Folder currFol : root.getFolders()) {
    for (Document currDoc : currFol.getDocuments()) {
        String text = currFol.getName() + "/" + currDoc.getName();
        float textWidth = font.getStringWidth(text) / 1000.0 * 12;

        PDAnnotationLink link = new PDAnnotationLink();
        PDGamma colourBlue = new PDGamma();
        colourBlue.setB(1);
        link.setColour(colourBlue);

        // add an action
        PDActionURI action = new PDActionURI();
        action.setURI(currFol.getName() + "/" + currDoc.getName());

        link.setAction(action);

        contentStream.beginText();
        contentStream.setFont(font, 12);
        contentStream.moveTextPositionByAmount(x, y);
        contentStream.drawString(text);
        contentStream.endText();

        PDRectangle position = new PDRectangle();
        position.setLowerLeftX(x);
        position.setLowerLeftY(y - 3);
        position.setUpperRightX(x + textWidth);
        position.setUpperRightY(y + 12);
        link.setRectangle(position);

        page.getAnnotations().add(link);

        y -= 15;
    }
}

Question:

I am trying to create a PDDocument and then add two pages to it. The first one containing the text "first page" and the second one being blank. I then split the PDDocument and put it into a list. When I try to access the first page (by using the get Method), I save it expecting to see a pdf with the text "first page" on it but all I get it a blank page. Any suggestions?

package split;

import java.io.File;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.util.Splitter;

public class pdfSplit {


    public static void main(String[] args) throws Exception {

        PDPage page1, page2;

        page1 = new PDPage();
        page2 = new PDPage();

        Splitter splitter = new Splitter();
        PDDocument document = new PDDocument();

        document.addPage(page1);
        document.addPage(page2);

        List<PDDocument> splittedPDF =  splitter.split(document);

        PDFont font = PDType1Font.HELVETICA_BOLD;

        PDPageContentStream contentStream = new PDPageContentStream(document, page1);

        contentStream.beginText();
        contentStream.setFont( font, 50 );
        contentStream.moveTextPositionByAmount( 100, 700 );
        contentStream.drawString( "First page" );
        contentStream.endText();

        contentStream.close();



        document = splittedPDF.get(0);      //No effect
        document.save("Random.pdf");
    }

}

Answer:

Your page is blank because you do the split before writing to the page content stream. Solution: move the splitting code to after closing your content stream. Correct code thus looks like this:

    PDPage page1, page2;

    page1 = new PDPage();
    page2 = new PDPage();

    Splitter splitter = new Splitter();
    PDDocument document = new PDDocument();

    document.addPage(page1);
    document.addPage(page2);

    PDFont font = PDType1Font.HELVETICA_BOLD;

    PDPageContentStream contentStream = new PDPageContentStream(document, page1);

    contentStream.beginText();
    contentStream.setFont(font, 50);
    contentStream.moveTextPositionByAmount(100, 700);
    contentStream.drawString("First page");
    contentStream.endText();

    contentStream.close();
    // now the page is filled!


    List<PDDocument> splittedPDF = splitter.split(document);

    document = splittedPDF.get(0);
    document.save("Random.pdf");

(This answer was done with version 1.8.10)

Question:

I have a problem with the pdfBox API. Im trying to encrypt a merged pdfdocument with the following code:

This is the function to merge / create the doc

    public static void fillMultipleReportsInOne(List<report> reports) throws IOException {

        PDFMergerUtility PDFmerger = new PDFMergerUtility(); 
        PDDocument resultPDF = new PDDocument();

        for (report report : reports) {

            try 
            {
                PDDocument pdfDocument = PDDocument.load(new File(SRC + "test.pdf"));
                // get the document catalog
                PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();

                // as there might not be an AcroForm entry a null check is necessary
                setFormFields(acroForm, report.getName(), report.getArea(), report.getOperatingActivities(), report.getVocationlaSchool(), report.getWeeklyTopics());
                // Save and close the filled out form.
                PDFmerger.appendDocument(resultPDF, pdfDocument);



        } catch (Exception e) {
            e.printStackTrace();
        }

    }


    encryptPDF(resultPDF, SRC + "merged.pdf");

}

And this is the function to encrypt:

 public static PDDocument encryptPDF(PDDocument pddocumentToEncrypt, String SRC) {

        // Define the length of the encryption key.
        // Possible values are 40 or 128 (256 will be available in PDFBox 2.0).
        int keyLength = 128;

        AccessPermission ap = new AccessPermission();

        // Disable printing, everything else is allowed
        ap.setCanModifyAnnotations(false);
        ap.setCanFillInForm(false);
        ap.setCanModify(false);


        // Owner password (to open the file with all permissions) is "12345"
        // User password (to open the file but with restricted permissions, is empty here) 
        StandardProtectionPolicy spp = new StandardProtectionPolicy("12334", "", ap);
        spp.setEncryptionKeyLength(keyLength);
        spp.setPermissions(ap);
        try {
            pddocumentToEncrypt.protect(spp);
            pddocumentToEncrypt.save(SRC);
            pddocumentToEncrypt.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return pddocumentToEncrypt;
    }

And finally the call of the function with all the sample data

report report1 = new report("TestUser1", "P&T", "operatingActivities", "weeklyTopics","vocationalSchool");
        report report2 = new report("TestUser2", "P&T2", "operatingActivities2", "weeklyTopics2","vocationalSchool2");
        report report3 = new report("TestUser3", "P&T3", "operatingActivities3", "weeklyTopics3","vocationalSchool3");
        report report4 = new report("TestUser4", "P&T4", "operatingActivities4", "weeklyTopics4","vocationalSchool4");

        List<report> reports = new ArrayList<>();
        reports.add(report1);
        reports.add(report2);
        reports.add(report3);
        reports.add(report4);

        fillMultipleReportsInOne(reports);

My outcome looks like this:

OUTCOME

Only the first field is filled with data, when all fields should have It's definetely an encryption problem because when I delete the document.protect() line the data is filled correctly. I also tried the acroForm.flatten() function -> no success...

Maybe someone had the same issue and is willing to help :) Thanks in advance - Alex

Here's the entire file pasted in pastebin: https://pastebin.com/L9auaTGH

With the code line

pddocumentToEncrypt.getDocumentCatalog().getAcroForm().refreshAppearances();

inside my encryption function, it worked


Answer:

Calling

pddocumentToEncrypt.getDocumentCatalog().getAcroForm().refreshAppearances();

fixes this for the 2.0 version. That version doesn't set the appearence (i.e. the visual representation of the form value) in the setValue() call when /NeedAppearances is set (see PDTerminalField.applyChange()). The /NeedAppearances setting (when true) means that the viewing application should create the appearance so that processing applications in-between do not need to do so; I suspect that the one or several of the permission settings prevent Adobe Reader from changing it when displaying.

Another solution would be to call

pdfDocument.getDocumentCatalog().getAcroForm().setNeedAppearances(false);

before setting the form values.

The only unsolved mystery is why the first value was visible in the merged file.

Question:

I am trying to split the single PDF into multiple. Like 10 page document into 10 single page document.

PDDocument source = PDDocument.load(input_file);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(file);
output.close();

Here the problem is, the new document's page size is different than original document. So some text are cropped or missing in new document. I am using PDFBox 2.0 and how I can avoid this?

UPDATE: Thanks @mkl.

Splitter did the magic. Here is the updated working part,

public static void extractAndCreateDocument(SplitMeta meta, PDDocument source)
      throws IOException {

    File file = new File(meta.getFilename());

    Splitter splitter = new Splitter();
    splitter.setStartPage(meta.getStart());
    splitter.setEndPage(meta.getEnd());
    splitter.setSplitAtPage(meta.getEnd());

    List<PDDocument> docs = splitter.split(source);
    if(docs.size() > 0){
      PDDocument output = docs.get(0);
      output.save(file);
      output.close();
    }
  }

public class SplitMeta {

  private String filename;
  private int start;
  private int end;

  public SplitMeta() {
  }
}

Answer:

Unfortunately the OP has not provided a sample document to reproduce the issue. Thus, I have to guess.

I assume that the issue is based in objects not immediately linked to the page object but inherited from its parents.

In that case using PDDocument.addPage is the wrong choice as this method only adds the given page object to the target document page tree without consideration of inherited stuff.

Instead one should use PDDocument.importPage which is documented as:

/**
 * This will import and copy the contents from another location. Currently the content stream is stored in a scratch
 * file. The scratch file is associated with the document. If you are adding a page to this document from another
 * document and want to copy the contents to this document's scratch file then use this method otherwise just use
 * the {@link #addPage} method.
 * 
 * Unlike {@link #addPage}, this method does a deep copy. If your page has annotations, and if
 * these link to pages not in the target document, then the target document might become huge.
 * What you need to do is to delete page references of such annotations. See
 * <a href="http://stackoverflow.com/a/35477351/535646">here</a> for how to do this.
 *
 * @param page The page to import.
 * @return The page that was imported.
 * 
 * @throws IOException If there is an error copying the page.
 */
public PDPage importPage(PDPage page) throws IOException

Actually even this method might not suffice as is as it does not consider all inherited attributes, but looking at the Splitter utility class one gets an impression what one has to do:

PDPage imported = getDestinationDocument().importPage(page);
imported.setCropBox(page.getCropBox());
imported.setMediaBox(page.getMediaBox());
// only the resources of the page will be copied
imported.setResources(page.getResources());
imported.setRotation(page.getRotation());
// remove page links to avoid copying not needed resources 
processAnnotations(imported);

making use of the helper method

private void processAnnotations(PDPage imported) throws IOException
{
    List<PDAnnotation> annotations = imported.getAnnotations();
    for (PDAnnotation annotation : annotations)
    {
        if (annotation instanceof PDAnnotationLink)
        {
            PDAnnotationLink link = (PDAnnotationLink)annotation;   
            PDDestination destination = link.getDestination();
            if (destination == null && link.getAction() != null)
            {
                PDAction action = link.getAction();
                if (action instanceof PDActionGoTo)
                {
                    destination = ((PDActionGoTo)action).getDestination();
                }
            }
            if (destination instanceof PDPageDestination)
            {
                // TODO preserve links to pages within the splitted result  
                ((PDPageDestination) destination).setPage(null);
            }
        }
        // TODO preserve links to pages within the splitted result  
        annotation.setPage(null);
    }
}

As you are trying to split the single PDF into multiple, like 10 page document into 10 single page document, you might want to use this Splitter utility class as is.

Tests

To test those methods I used the output of the PDF Clown sample output AnnotationSample.Standard.pdf because that library heavily depends on inheritance of page tree values. Thus, I copied the content of its only page to a new document using either PDDocument.addPage, PDDocument.importPage, or Splitter like this:

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(new File(RESULT_FOLDER, "PageAddedFromAnnotationSample.Standard.pdf"));
output.close();

(CopyPages.java test testWithAddPage)

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.importPage(page);
output.save(new File(RESULT_FOLDER, "PageImportedFromAnnotationSample.Standard.pdf"));
output.close();

(CopyPages.java test testWithImportPage)

PDDocument source = PDDocument.load(resource);
Splitter splitter = new Splitter();
List<PDDocument> results = splitter.split(source);
Assert.assertEquals("Expected exactly one result document from splitting a single page document.", 1, results.size());
PDDocument output = results.get(0);
output.save(new File(RESULT_FOLDER, "PageSplitFromAnnotationSample.Standard.pdf"));
output.close();

(CopyPages.java test testWithSplitter)

Only the final test copied the page faithfully.

Question:

I'm trying to create temporary PDF files in Java using PDDocument. I'm employing the following method to create a temporary PDF file.

/* Create a temporary PDF file.*/
private File createPdf(String fileName) throws IOException {
    final PDDocument document = new PDDocument();
    final File file = File.createTempFile(fileName, ".pdf");
    //write it
    BufferedWriter bw = new BufferedWriter(new FileWriter(file));
    bw.write("This is the temporary pdf file content");
    bw.close();
    document.save(file);        
    document.close();
    return file;
}

This is the test.

@Test
public void testCreateAndMergePdfs() throws IOException {
    Collection<File> pdfs = new ArrayList<>(Arrays.asList(createPdf("File1"), createPdf("File2")));
    assertFalse(CollectionUtils.isEmpty(pdfs));
    PdfPrintPojo pdfPrintPojo = new PdfPrintPojo(pdfs);
    File mergedFile = service.createAndMergePDFs(pdfPrintPojo, "Merged");
    assertNotNull(mergedFile);
    List<File> list = new ArrayList<>(pdfs);
    File file1 = list.get(0);
    File file2 = list.get(1);
    assertTrue(FileUtils.contentEquals(file1, file2));
}

What I'm trying to do here is to create and merge two PDF files. When I run the test, it creates two PDF files in the temp folder, for example, \AppData\Local\Temp\File16375814641476797612.pdf and \AppData\Local\Temp\File24102718409195239661.pdf and the merged file at \AppData\Local\Temp\Merged_merged_3755858389884894769.pdf. But the test fails at assertTrue(FileUtils.contentEquals(file1, file2)); When I try to open the PDF files in the temp folder, it says that the PDF is corrupted. Also, I have no idea why the files are not being saved as File1 and File2. Can anyone help me with this?


Answer:

Using Apache PDFBox tutorial, I managed to create a working PDF file(s). The method was changed as follows.

/* Create a temporary PDF file.*/
private File createPdf(String fileName) throws IOException {
    // Create a document and add a page to it
    final PDDocument document = new PDDocument();
    PDPage page = new PDPage();
    document.addPage(page);

    // Create a new font object selecting one of the PDF base fonts
    PDFont font = PDType1Font.HELVETICA_BOLD;

    // Start a new content stream which will "hold" the to be created content
    PDPageContentStream contentStream = new PDPageContentStream(document, page);

    // Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
    contentStream.beginText();
    contentStream.setFont(font, 12);
    contentStream.newLineAtOffset(100, 700);
    contentStream.showText("Hello World");
    contentStream.endText();

    // Make sure that the content stream is closed:
    contentStream.close();

    // Save the results and ensure that the document is properly closed:
    File file = File.createTempFile(fileName, ".pdf");
    document.save(file);
    document.close();
    return file;
}

As for the test, I took the approach of using PDDocument to load the files, then extract data as String using PDFTextStripper and using assertions on those Strings.

 @Test
public void testCreateAndMergePdfs() throws IOException {
    Collection<File> pdfs = new ArrayList<>(Arrays.asList(createPdf("File1"), createPdf("File2")));
    assertFalse(CollectionUtils.isEmpty(pdfs));
    PdfPrintPojo pdfPrintPojo = new PdfPrintPojo(pdfs);
    File mergedFile = service.createAndMergePDFs(pdfPrintPojo, "Merged");
    assertNotNull(mergedFile);
    List<File> list = new ArrayList<>(pdfs);

    /* Load the PDF files and extract data as String. */
    PDDocument document1 = PDDocument.load(list.get(0));
    PDDocument document2 = PDDocument.load(list.get(1));
    PDDocument merged = PDDocument.load(mergedFile);

    PDFTextStripper stripper = new PDFTextStripper();
    String file1Data = stripper.getText(document1);
    String file2Data = stripper.getText(document2);
    String mergedData = stripper.getText(merged);

    /* Assert that data from file 1 and 2 are equal with each other and merged file. */
    assertEquals(file1Data, file2Data);
    assertEquals(file1Data + file2Data, mergedData);
}

Question:

I'm looking to get an accurate size of each page in a PDF as part of a Unit test of PDF's I'll be creating. As I'm dealing with PDFs that have many different page sizes in each document the code returns an ArrayList of dimensions.

AFAIK each page can have its own DPI setting too.

I've done quite a bit of Googling but I've only come up with this which only gives me part of the answer, as I still need to work out what DPI each page is.

PDFBox - find page dimensions

public static ArrayList<float[]> getDimentions(PDDocument document) {
    ArrayList<float[]> dimensions = new ArrayList<>();
    float[] dim = new float[2];
    //Loop Round Each Page
    //Get Dimensions of each page and DPI
    for (int i = 0; i < document.getNumberOfPages(); i++) {
        PDPage currentPage = document.getPage(i);
        PDRectangle mediaBox = currentPage.getMediaBox();
        float height = mediaBox.getHeight();
        float width = mediaBox.getWidth();
        // How do I get the DPI now????
    }
    //Calculate Size of Page in mm  (
    //Get Dimensions of each page and DPI ( https://stackoverflow.com/questions/20904191/pdfbox-find-page-dimensions/20905166  )
    //Add size of page to list of page sizes
    //Return list of page sizes.
    return dimensions;
}

Answer:

The page dimensions (media box / crop box) are given in default userspace units which in turn default to 1/72 inch. So simply divide the box width or height by 72 to get the page width or height in inch.

This does not correspond to a DPI value of 72, though, because that would be a resolution value and a pdf does not have a resolution, the default userspace units merely are a different measurement unit.

Question:

I am trying to split a large PDF of type document bundle. This PDF has an index page which links to different pages eg.

Index:

Topic 1: page 1-5

Topic 2: page 12-25

I am currently using PDFbox to laod the PDF and get the page numbers but I am looking for a way to get the metadata to allow me to group the pages by their topics

If there a way of retrieving the document structure so I can group break the document down into smaller PDFs eg. Topic 1 now becomes a Single PDF with pages 1-5 merged.

Here is the code:

PDDocumentOutline outline = pdocument.getDocumentCatalog().getDocumentOutline();

for (PDOutlineItem item : outline.children()) {

String pageTitle=item.getTitle(); //Topic 1

PDPage destinationPage=item.findDestinationPage(pdocument);

//How do I get actual pageNumber of Page?

//How do I get Destination reference string ie. pg 1-5


}

Answer:

You may wanna have a look at section 12.3.3 "Document Outline" in the PDF 1.7 specification. The document outline is a tree structure providing links to various parts of the document. For example, if you convert a LibreOffice document to PDF the headings would be used for the outline.

If your PDF has such an outline, you can use it to split it.

If it only has an index page, there may be PDF tags (see section 14.8 "Tagged PDF") available for easily getting the needed data.

If there are no PDF tags, you would probably need to parse the text and analyse it to get the needed information.

Question:

I am updating the values of an editable PDF using PDFBox. Instead of saving, I want to return stream. I saved it, it works all fine. Now I want to return byte[] instead of saving it.

public static void main(String[] args) throws IOException
{
    String formTemplate = "myFormPdf.pdf";

    try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
    {
        PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();

        if (acroForm != null)
        {

            PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
            field.setValue("Text Entry");
        }

        pdfDocument.save("updatedPdf.pdf"); // instead of this I need STREAM
    }
}

I tried SerializationUtils.serialize but it fails to serialize it.

Failed to serialize object of type: class org.apache.pdfbox.pdfmodel.PDDcoumemt

Answer:

Use the overloaded save method which accepts an OutputStream and use ByteArrayOutputStream.

public static void main(String[] args) throws IOException
{
    String formTemplate = "myFormPdf.pdf";

    try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
    {
        PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();

        if (acroForm != null)
        {

           PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
           field.setValue("Text Entry");
        }

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        pdfDocument.save(baos);
        byte[] pdfBytes = baos.toByteArray(); // PDF Bytes
    }
}

Question:

I am using PDFBox to extract text from PDF documents. Then once, extracted, I will insert those text into a table in MySQL.

The code:

PDDocument document = PDDocument.load(new File(path1));

if (!document.isEncrypted()) {
    PDFTextStripper tStripper = new PDFTextStripper();
    String pdfFileInText = tStripper.getText(document);
    String lines[] = pdfFileInText.split("\\r?\\n");
    for (String line : lines) {
        String[] words = line.split(" ");

        String sql="insert IGNORE into  test.indextable values (?,?);";

        preparedStatement = con1.prepareStatement(sql);
        int i=0;
        for (String word : words) {
            // check if one or more special characters at end of string then remove OR
            // check special characters in beginning of the string then remove
            // insert every word directly to table db
            word=word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
            preparedStatement.setString(1, path1);
            preparedStatement.setString(2, word);

            /* preparedStatement.executeUpdate();
            System.out.print("Add ");*/

            preparedStatement.addBatch();

            i++;
            if (i % 1000 == 0) {
                preparedStatement.executeBatch();

                System.out.print("Add Thousand");
            }
        }

        if (i > 0) {
            preparedStatement.executeBatch();

            System.out.print("Add Remaining");
        }
    }
}

The code works fine but as you can see if the document is large and has like 10 million words inside it, the lines[] is not gonna do any justice and will throw out of memory exception.

I can't think of a solution to this. Is there any way I could just extract and insert the words directly to the db or it's not possible?

EDITED :

This is what I did:

processText method :

public void processText(String text) throws SQLException {

    String lines[] = text.split("\\r?\\n");
    for (String line : lines) {
        String[] words = line.split(" ");


        String sql="insert IGNORE into  test.indextable values (?,?);";


        preparedStatement = con1.prepareStatement(sql);
        int i=0;
        for (String word : words) {

            // check if one or more special characters at end of string then remove OR
            // check special characters in beginning of the string then remove
            // insert every word directly to table db
            word=word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
            preparedStatement.setString(1, path1);
            preparedStatement.setString(2, word);



            preparedStatement.addBatch();

            i++;
            if (i % 1000 == 0) {
                preparedStatement.executeBatch();

                System.out.print("Add Thousand");
            }




        }




        if (i > 0) {
            preparedStatement.executeBatch();

            System.out.print("Add Remaining");

        }

    }
    preparedStatement.close();
    System.out.println("Successfully commited changes to the database!");

}

index method (calling the above method):

public void index() throws Exception {
       // Connection con1 = con.connect();
        try {

            // Connection con1=con.connect();
           // Connection con1 = con.connect();
            Statement statement = con1.createStatement();

            ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active' LIMIT 5");


            while (rs.next()) {
                // get the filepath of the PDF document
                 path1 = rs.getString(2);
               int getNum = rs.getInt(1);
                // while running the process, update status : Processing
                //updateProcess_DB(getNum);
                Statement test = con1.createStatement();
                test.executeUpdate("update filequeue SET STATUS ='Processing' where UniqueID="+getNum);



                try {
                    // call the index function


                    /*Indexing process = new Indexing();

                    process.index(path1);*/

                    PDDocument document = PDDocument.load(new File(path1));

                    if (!document.isEncrypted()) {

                        PDFTextStripper tStripper = new PDFTextStripper();
                        for(int p=1; p<=document.getNumberOfPages();++p) {
                            tStripper.setStartPage(p);
                            tStripper.setEndPage(p);
                            String pdfFileInText = tStripper.getText(document);
                            processText(pdfFileInText);
                        }


                        }

Answer:

Your current code uses the string pdfFileInText which is gathered from tStripper.getText(document); and gets the whole document at once. First refactor all what you do with this string (it starts with pdfFileInText.split) in a separate method, e.g. processText. Then change your code to this:

PDFTextStripper tStripper = new PDFTextStripper();
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
    stripper.setStartPage(p); // 1-based
    stripper.setEndPage(p); // 1-based
    String pdfFileInText = tStripper.getText(document);
    processText(pdfFileInText);
}

The new code processes each page separately. This way you'll be able to do the database inserts in smaller steps and you won't have to store all the words of the documents, only the words of one page.

Question:

I am using the following code for extracting images from pdf which is in PDFA1-a format but I am not able to get the images .

List<PDPage> list = document.getDocumentCatalog().getAllPages();

String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {

    PDResources pdResources = page.findResources();

    Map pageImages = pdResources.getImages();
    if (pageImages != null) {
        InputStream xmlInputStream = null;
        Iterator imageIter = pageImages.keySet().iterator();
        while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);

            System.out.println(convertStreamToString(xmlInputStream));
            System.out.println(pdxObjectImage.hashCode());
            System.out.println(pdxObjectImage.getColorSpace().getJavaColorSpace().isCS_sRGB());

            pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
            totalImages++;

            break;
        }
    }
}

I am able to extract images for notmal PDFs using above code but am not able to extract it for PDFA1-a format pdfs. It seems the following line

PDResources pdResources = page.findResources(); 

is not returning images I have even tried page.getResources() but still not getting any images.I have even tried to use itext but still it is not giving me any images.

If i try to convert the page of PDF to image using the following code

BufferedImage bufferedImage = page.convertToImage();
File outputfile = new File(destinationDir+"image1.JPEG");
ImageIO.write(bufferedImage, "JPEG", outputfile);

these images seem to have no metadata associated with them So I still am not able to know their dpi or whether they are color or grey scale.

Currently I am using PDFBox for doing this.I have already spent 2 days on this searching on google but still I havent found any code or documentation for doing this.

How to do this in java ??

Is it possible to get DPI or whether the pdf is color or black and white without extracting the images ??


Answer:

Your problems are a combination of two problems:

1) the "break;". Your file has two images. The first one is transparent or grey or whatever and JPEG encoded, but it isn't the one you want. The second one is the one you want but the break aborts after the first image. So I just changed a code segment of yours to this:

while (imageIter.hasNext())
{
     String key = (String) imageIter.next();
     PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
     System.out.println(totalImages);
     pdxObjectImage.write2file("C:\\SOMEPATH\\" + fileName + "_" + totalImages);
     totalImages++;

     //break;
 }

2) Your second image (the interesting one) is JBIG2 encoded. To decode this, you need to add the levigo plugin your class path, as mentioned here. If you don't, you'll get this message in 1.8.8, unless you disabled logging:

ERROR [main] org.apache.pdfbox.filter.JBIG2Filter:69 - Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.

(You didn't get that error message because it is the second one that is JBIG2 encoded)

Three bonus hints:

3) if you created this image yourself, e.g. on a photocopy machine, find out how to get PDF images without JBIG2 compression, it is somewhat risky.

4) don't use pdResources.getImages(), the getImages call is deprecated. Instead, use getXObjects(), and then check the type of what you get when iterating.

 Iterator imageIter = pageImages.keySet().iterator();
 while (imageIter.hasNext())
 {
     String key = (String) imageIter.next();
     Object o = pageImages.get(key);
     if (o instanceof PDXObjectImage)
     {
         PDXObjectImage pdxObjectImage = (PDXObjectImage) o;

         // do stuff
     }
 }

5) use a foreach loop.

And if it wasn't already obvious: this has nothing to do with PDF/A :-)

6) I forgot you also asked how to see if it is a b/w image, here's some simple code (not optimized) that I mentioned in the comments:

BufferedImage bim = pdxObjectImage.getRGBImage();

boolean bwImage = true;

int w = bim.getWidth();
int h = bim.getHeight();
for (int y = 0; y < h; y++)
{
    for (int x = 0; x < w; x++)
    {
        Color c = new Color(bim.getRGB(x, y));
        int red = c.getRed();
        int green = c.getGreen();
        int blue = c.getBlue();
        if (red == 0 && green == 0 && blue == 0)
        {
            continue;
        }
        if (red == 255 && green == 255 && blue == 255)
        {
            continue;
        }
        bwImage = false;
        break;
    }
    if (!bwImage)
        break;
}
System.out.println(bwImage);

Question:

I have written the following small Java main method. It takes in a (hardcoded for testing purposes!) PDF document I know contains active elements in the form and need to flatten it.

public static void main(String [] args) {

    try {
        // for testing
        Tika tika = new Tika();
        String filePath = "<path-to>/<pdf-document-with-active-elements>.pdf";
        String fileName = filePath.substring(0, filePath.length() -4);
        File file = new File(filePath);
        if (tika.detect(file).equalsIgnoreCase("application/pdf")) {
            PDDocument pdDocument = PDDocument.load(file);
            PDAcroForm pdAcroForm = pdDocument.getDocumentCatalog().getAcroForm();
            if (pdAcroForm != null) {
                pdAcroForm.flatten();
                pdAcroForm.refreshAppearances();

                pdDocument.save(fileName + "-flattened.pdf");
            }
            pdDocument.close();
        }
    }
    catch (Exception e) {
        System.err.println("Exception: " + e.getLocalizedMessage());
    }
}

What kind of test would assert the File(<path-to>/<pdf-document-with-active-elements>-flattened.pdf) generated by this code would, in fact, be flat?


Answer:

What kind of test would assert that the file generated by this code would, in fact, be flat?

Load that document anew and check whether it has any form fields in its PDAcroForm (if there is a PDAcroForm at all).

If you want to be thorough, also iterate through the pages and assure that there are no Widget annotations associated to them anymore.

And to really be thorough, additionally determine the field positions and contents before flattening and apply text extraction at those positions to the flattened pdf. This verifies that the form has not merely been dropped but indeed flattened.

Question:

I am using android pdf document library to convert image into pdf but it is generating very large size pdf.

PdfDocument document = new PdfDocument();
PdfDocument.PageInfo pageInfo =new 
PdfDocument.PageInfo.Builder(bitmap.getWidth(), bitmap.getHeight(), 1).create();                            
PdfDocument.Page  page = document.startPage(pageInfo);
Bitmap scaledBitmap = Bitmap.createScaledBitmap(bitmap, bitmap.getWidth(), 
bitmap.getHeight(),false);                        

Canvas canvas = page.getCanvas();
canvas.drawBitmap(scaledBitmap, 0f, 0f, null);
document.finishPage(page);

document.writeTo(new FileOutputStream(Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOCUMENTS)+"/"+newPDFNameSingle));
document.close();

here is the apache pdf box implementation but it is cutting image in output pdf

PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);

PDPageContentStream contentStream = new PDPageContentStream(document, page);
InputStream inputStream = new FileInputStream(tempFile);
PDImageXObject ximage = JPEGFactory.createFromStream(document,inputStream);

contentStream.drawImage(ximage, 20, 20);
contentStream.close();

document.save(Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOCUMENTS)+"/"+newPDFNameSingle);
                            document.close();

How can I achieve regular size pdf generation? My image in 100 kb in size but pdf generating 1 mb file.


Answer:

In your android pdf document library code you set the page size to the image height and width values

PdfDocument.PageInfo.Builder(bitmap.getWidth(), bitmap.getHeight(), 1).create();                            

and draw the image at the origin:

canvas.drawBitmap(scaledBitmap, 0f, 0f, null);

You can do the same in your PDFBox code:

PDDocument document = new PDDocument();

PDImageXObject ximage = JPEGFactory.createFromStream(document,imageResource);

PDPage page = new PDPage(new PDRectangle(ximage.getWidth(), ximage.getHeight()));
document.addPage(page);

PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.drawImage(ximage, 0, 0);
contentStream.close();

(DrawImage test testDrawImageToFitPage)

Alternatively, as discussed in comments, you can set the current transformation matrix before drawing the image to scale it down to fit the page.

Question:

I'm currently working with PDFs on a Java application that makes some modifications to PDF Documents.

Currently, the signing of these PDFs is working, as I am using classes such as FileInputStream and FileOutputStream. Basically, I copy the original documents from a source folder, and then put them in a output folder, with. I am using PDDocument class with pdfbox 1.8.9

However, I want to use the same file, meaning I don't pretend to copy the PDFs anymore. I want to grab the document, sign it, and overwrite the original one.

Since I learned that having FileInputStream and FileOutputStream pointing at the same file is not a good idea, I simply tried to use the File class.

I tried the following:

       File file = new File(locOriginal);    
                PDDocument doc = PDDocument.load(file); 
                PDSignature signature = new PDSignature();
                Overlay overlay = new Overlay();

//The signature itself. It has not been modified
            signature.setFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED); // default filter

       signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);

                if (msg.getAreaNegocio().startsWith("A")) {
                    signature.setName(this.campoCertificadoAcquiring);
                    signature.setLocation(this.localCertificadoAcquiring);
                    signature.setReason(this.razaoCertificadoAcquiring);
                }else {
                    signature.setName(this.campoCertificadoIssuing);
                    signature.setLocation(this.localCertificadoIssuing);
                    signature.setReason(this.razaoCertificadoIssuing);
                }

                // register signature dictionary and sign interface
               doc.addSignature(signature,this);
              doc.saveIncremental(file.getAbsolutePath());

               doc.close();

My PDF file does get overwritten as intended, yet, the signature is not valid anymore when I open the file. I read these questions... Does it relate to any of these issues? What can I do to solve to this?

PDFBox 1.8.10: Fill and Sign PDF produces invalid signatures

PDFBox - opening and saving a signed pdf invalidates my signature

Thanks for the help!


Answer:

The 1.8.* saveIncremental(filename) was buggy until PDFBox 1.8.16. This is described in PDFBOX-4312 but is confusing because the user deleted most of his own messages and had multiple other problems. If you insist on using an outdated version (that has a security issue), then try this code instead of calling saveIncremental(filename):

//BEWARE: do not "optimize" this method by using buffered streams,
// because COSStandardOutputStream only allows seeking
// if a FileOutputStream is passed, see PDFBOX-4312.
FileInputStream fis = new FileInputStream(fileName);
byte[] ba = IOUtils.toByteArray(fis);
fis.close();
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(ba);
fis = new FileInputStream(fileName);
saveIncremental(fis, fos);

And no, I don't think that the questions you linked to related to your issue.

Btw I don't consider overwriting the original file to be a good idea. You are risking the loss of your file if there is an error or a power loss.

See also the comment by mkl: setFilter() is usually called with parameter PDSignature.FILTER_ADOBE_PPKLITE.

Question:

I'm trying to sign a pdf using this method, but get a document with no size:

public static void sign(PDDocument doc) throws KeyStoreException, NoSuchAlgorithmException, CertificateException,
        IOException, UnrecoverableKeyException {
    System.out.println("Document pages ? " + doc.getNumberOfPages());
    KeyStore ks = KeyStore.getInstance(KeyStore.getDefaultType());
    ks.load(VisibleSignature.class.getResourceAsStream(CERT_FILE), ALIAS_PASS);
    System.out.println("KeyStore is null ? " + (ks == null));
    VisibleSignature vs = new VisibleSignature(ks, ALIAS_PASS.clone());
    InputStream is = Resource.get(IMAGE_FILE);
    int page = 1;
    vs.setVisibleSignDesigner(doc, 0, 0, -50, is, page);
    is.close();
    vs.setVisibleSignatureProperties("Test", "Test", "Test", 0, page, true);
    PDSignature signature = new PDSignature();
    PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
    System.out.println("Acroform is null ? " + (acroForm == null));
    System.out.println("Acroform getNeedAppearances ? " + (acroForm.getNeedAppearances()));
    if (acroForm != null && acroForm.getNeedAppearances())
        if (acroForm.getFields().isEmpty())
            acroForm.getCOSObject().removeItem(COSName.NEED_APPEARANCES);
        else
            System.out.println("/NeedAppearances is set, signature may be ignored by Adobe Reader");
    signature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
    signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);
    if (vs.visibleSignatureProperties != null) {
        vs.visibleSignatureProperties.buildSignature();
        signature.setName(vs.visibleSignatureProperties.getSignerName());
        signature.setLocation(vs.visibleSignatureProperties.getSignerLocation());
        signature.setReason(vs.visibleSignatureProperties.getSignatureReason());
        System.out.println("SignerName " + vs.visibleSignatureProperties.getSignerName());
    }
    signature.setSignDate(Calendar.getInstance());
    vs.signatureOptions = new SignatureOptions();
    vs.signatureOptions.setVisualSignature(vs.visibleSignatureProperties.getVisibleSignature());
    vs.signatureOptions.setPage(vs.visibleSignatureProperties.getPage() - 1);
    doc.addSignature(signature, vs.signatureOptions);
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.saveIncremental(baos);
    doc.close();
    IOUtils.closeQuietly(vs.signatureOptions);
    byte[] content = baos.toByteArray();
    System.out.println("Content length: >>>>>>>>>>>>>>>>>>> " + content.length);
}

And this is what I get in eclipse log:

18:50:25,702 INFO [default task-14] stdout - Document pages ? 1

18:50:25,740 INFO [default task-14] stdout - KeyStore is null ? false

18:50:25,779 INFO [default task-14] stdout - Acroform is null ? false

18:50:25,780 INFO [default task-14] stdout - Acroform getNeedAppearances ? false

18:50:25,782 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - PDF Structure has been created

18:50:25,782 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDFTemplateCreator - pdf building has been started

18:50:25,782 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - ProcSet array has been created

18:50:25,782 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - PDF page has been created

18:50:25,783 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - AcroForm has been created

18:50:25,788 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Signature field has been created

18:50:25,788 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - PDSignature has been created

18:50:25,788 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - AcroForm dictionary has been created

18:50:25,789 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Matrix has been added

18:50:25,792 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Signature rectangle has been created

18:50:25,793 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Formatter rectangle has been created

18:50:25,815 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Visible Signature Image has been created

18:50:25,815 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Holder form stream has been created

18:50:25,816 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Holder form resources have been created

18:50:25,816 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Holder form has been created

18:50:25,816 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - PDF appearance dictionary has been created

18:50:25,817 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Stream of another form (inner form - it will be inside holder form) has been created

18:50:25,817 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Resources of another form (inner form - it will be inside holder form)have been created

18:50:25,817 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Another form (inner form - it will be inside holder form) has been created

18:50:25,817 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Now inserted inner form inside holder form

18:50:25,817 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Created image form stream

18:50:25,817 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Created image form resources

18:50:25,818 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Created image form

18:50:25,818 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Created background layer form

18:50:25,818 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Inserted ProcSet to PDF

18:50:25,818 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Injected appearance stream to pdf

18:50:25,818 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - Visible signature has been created

18:50:25,819 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDVisibleSigBuilder - WidgetDictionary has been created

18:50:25,825 DEBUG [default task-14] org.apache.pdfbox.cos.COSStream - Create InputStream called without data being written before to stream.

18:50:25,825 INFO [default task-14] org.apache.pdfbox.pdmodel.interactive.digitalsignature.visible.PDFTemplateCreator - stream returning started, size= 21301

18:50:25,825 INFO [default task-14] stdout - SignerName Test

18:50:25,857 INFO [default task-14] stdout - Content length: >>>>>>>>>>>>>>>>>>> 0

Does anyone know what is going on here?


Answer:

In your code you don't provide the document with a SignatureInterface implementation it can use to sign with. Not providing that in this use case makes PDFBox assume you will create the signature externally (for an example look at the original CreateVisibleSignature method sign). In that case the result is written to the output stream after you set the signature using the ExternalSigningSupport method setSignature. As you don't do so, your ByteArrayOutputStream baos remains empty.

But you do have a SignatureInterface instance which you already initialized with certificate and key material: your VisibleSignature vs. Thus, I assume your not providing a SignatureInterface wasn't done on purpose.

So to provide the SignatureInterface you have to use a PDDocument.addSignature overload with a SignatureInterface parameter. E.g. replacing your call

doc.addSignature(signature, vs.signatureOptions);

by

doc.addSignature(signature, vs, vs.signatureOptions);

makes your code working as desired.

Question:

I am writing a Java application to work as a template reader and writer. I have had success with working with text, but having some dificulty with the images...

Getting the images was the easy part - using a class extending PDFStreamEngine

package readingPdf;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.util.Matrix;

public class ImageStripper extends PDFStreamEngine {

    ArrayList<Object  []> imagesData = null;
    public ImageStripper() throws IOException {
        // preparing PDFStreamEngine
        addOperator(new Concatenate());
        addOperator(new DrawObject());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new SetMatrix());
        imagesData = new ArrayList<Object[]>();
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        String operation = operator.getName();
        if ("Do".equals(operation)) {
            COSName objectName = (COSName) operands.get(0);
            // get the PDF object
            PDXObject xobject = getResources().getXObject(objectName);
            // check if the object is an image object
            if (xobject instanceof PDImageXObject) {
                Object[] imageData = new Object[3];
                PDImageXObject image = (PDImageXObject) xobject;

                Matrix ctmNew = getGraphicsState().getCurrentTransformationMatrix();

                // position of image in the pdf in terms of user space units
                System.out.println("position in PDF = " + ctmNew.getTranslateX() + ", " + ctmNew.getTranslateY()
                        + " in user space units");

                imageData[0] = ctmNew.getTranslateX();// xPos
                imageData[1] = ctmNew.getTranslateY();// yPos

                imageData[2] = image;//Image

                imagesData.add(imageData);

            } else if (xobject instanceof PDFormXObject) {
                PDFormXObject form = (PDFormXObject) xobject;
                showForm(form);
            }
        } else {
            super.processOperator(operator, operands);
        }
    }

    public ArrayList<Object[]> getImagesList(){
        return imagesData;
    }
}

next is the implementation thereof

public class PDFManager{

    private PDFParser parser;
    private PDDocument pdDoc;
    private PDDocument retDoc;
    private COSDocument cosDoc;
    private PDPage page;
    private String filePath;
    private File file; 

    public PDDocument transferImage() throws IOException {
        this.pdDoc = null;
        this.cosDoc = null;

        file = new File(filePath);
        parser = new PDFParser(new RandomAccessFile(file, "r"));
        parser.parse();
        cosDoc = parser.getDocument();
        pdDoc = new PDDocument(cosDoc);

        //Get Image Data
        ImageStripper imageStripper = new ImageStripper();
        imageStripper.processPage(pdDoc.getPage(0));
        ArrayList<Object []> imageList = imageStripper.getImagesList();

        //Close Doc
        pdDoc.close();
        cosDoc.close();

        //Create new PDF Doc
        retDoc = new PDDocument();
        page = new PDPage(new PDRectangle(PDRectangle.A4.getHeight(), PDRectangle.A4.getWidth())); 
        retDoc.addPage(page);

        PDPageContentStream cs = new PDPageContentStream(retDoc, page, AppendMode.OVERWRITE, true);

        for(int pos = 0; pos < imageList.size() ; pos++) {
            Object [] imageData = imageList.get(pos);

            float xPos = (float)imageData[0];
            float yPos = (float)imageData[1];
            PDImageXObject image = (PDImageXObject)imageData[2];
            cs.drawImage(image, xPos, yPos);
        }

        cs.close();
        return retDoc;
    }

    public static void main(String[] args) throws IOException {

        PDFManager pdfManager = new PDFManager();

        PDDocument doc =pdfManager.ToText("c:\\test\\test.pdf"); 

        doc.save("c:\\test\\test2.pdf");
        doc.close();
    }
}

Now the problem comes in at the point where I am writing calling the cs.drawImage. All the code executes without any issue except when trying to save the new file... I get the exception COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?

I suspect that there is still metadata linking the image to the original Document it was extracted from as calling PDImageXobject.createFromFile("c:\\test\\testImage.png", doc) returns a new instance of PDImageXObject which writes perfectly. As the PDDocument that gets written to gets passed into the PDImageXObject I suspect it gets linked in some way or another.

I can not save the image to a temp location as this is just testing for a POC.

Any assistance would be appreciated


Answer:

@ Tilman Hausherr

Thanks for the solution

I moved the closing of the original document into a seperate method which I called after writing the file

public void closeFiles(){
    pdDoc.close();
    cosDoc.close();
}

Question:

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2. While trying to split the pdf file to single pages using the following code:

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

I receive the following exception

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.

There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.

One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks

Edit:

This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:

 1. I do not have the size problem mentioned in the aforementioned
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
    the size of each slice is an average of 80KB with total size of
    30.7MB.
 2. The Split throws the exception even before it returns the splitted parts.

I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.


Answer:

PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.

An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

Question:

I am using PDFBOX 1.8.10.

If I load the PDF File into byte array, it works -

File file = new File(args[0]);
FileInputStream fis = new FileInputStream(file);   //Normal PDF File
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
try {
    for (int readNum; (readNum = fis.read(buf)) != -1;) {
        bos.write(buf, 0, readNum); //no doubt here is 0
    }
} catch (IOException ex) {
    ex.printStackTrace();
}
byte[] bytes = bos.toByteArray();
CheckIsPDF(bytes);
pdf = PDDocument.load(new ByteArrayInputStream(bytes)); //**No exception here**

But if the same file is stored in a database and If i try to read it through the above code, I get the following exception- "java.io.IOException: Error: End-of-File, expected line".

This is the code to read from DB and populate the PDF-

List<byte[]> forms; //this gets populated from database. The data stored in DB is HEX.
for(byte[] file : forms){
    try{
        int var=file.length;

        pdDocument = PDDocument.load(new ByteArrayInputStream(file)); //**Exception** 

        fieldLists = PDFFormUtils.printFields( pdDocument );

    }
    catch(Exception e){
        e.printStackTrace();
    }
}

Answer:

As discussed in the comments, the cause of the problem was that the content of the blob wasn't a PDF. The blob content is:

43 3a 5c 4d 42 43 50 4f 53 5c 52 65 6e 74 2e 70 64 66

A pdf starts with "%PDF", so in hex this would be

25 50 44 46

The hex sequence you mention translates to

C:\MBCPOS\Rent.pdf

which means that somebody saved the file name instead of the file contents into the blob.

Question:

I have some questions about parsing pdf anfd how to:

  1. what is the purpose of using

PDDocument.loadNonSeq method that include a scratch/temporary file?

  1. I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?

For example

File pdfFile =  new File("mypdf.pdf");
File tmp_file =  new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
  PDFTextStripper stripper = new PDFTextStripper();
        Writer destination = new StringWriter();
        String xml="";
        stripper.setStartPage(index);
        stripper.setEndPage(index);
        stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}

Is this code above a right loadNonSeq use and is it a good practice to read PDF page per page without vaste in memory? I use page per page reading because I need to write text in XML using DOM memory (using stripping technique, I decide to produce an XML for every page)


Answer:

  1. what is the purpose of using PDDocument.loadNonSeq method that include a scratch/temporary file?

PDFBox implements two ways to read a PDF file.

  • loadNonSeq is the way documents should be loaded
  • load is the way documents should not be loaded but one might try to repair flles with broken cross references this way

In the 2.0.0 development branch, the algorithm formerly used for loadNonSeq is now used for load and the algorithm formerly used for load is not used anymore.

  1. I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?

Using loadNonSeq instead of load may improve memory usage for multi-revision PDFs because it only reads objects still referenced from the reference table while load can keep more in memory.

I don't know, though, whether using a scratch file makes a big difference.

is it a good practice to read PDF page per page without vaste in memory?

Internally PDFBox parses the given range page after page, too. Thus, if you process the stripper output page-by-page, it certainly is ok to parse it page by page.

Question:

I am trying to insert a jpg file 1680 in width and 1080 in height into a PDF document using Java PDFBox library. I am trying to insert the image at 20,20 of the document and only 1/4 of image show up in the PDF. I need to insert multiple images and each image into a separate page. Here is my code, please tell me what I did wrong ?

import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

public class pdfAddingPages {

public static void main(String[] args) {
     //Creating PDF document object 
      PDDocument document = new PDDocument();


         PDPage Page1 = new PDPage();

         //Adding the blank page to the document
         document.addPage( Page1 );


    //Creating PDImageXObject object
    PDImageXObject pdImage = null;
    try {
        pdImage = PDImageXObject.createFromFile("C:/Test/image1680x1080.jpeg",document);
    } catch (IOException e4) {
        // TODO Auto-generated catch block
        e4.printStackTrace();
    }

      //creating the PDPageContentStream object
      PDPageContentStream contents;
    try {
        contents = new PDPageContentStream(document, Page1);
        contents.drawImage(pdImage, 20, 20);  // positon at 20,20
        //contents.drawImage(pdImage, 0, 0, 1638, 1080);

        contents.close();
        //int chartWidth = 1638;  //2000, 1900, 1800
            //int chartHeight = 1080 ;

    } catch (IOException e3) {
        // TODO Auto-generated catch block
        e3.printStackTrace();
    }
    //Saving the document
    try {
        document.save("C:/Test/my_doc_image.pdf");
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
      System.out.println("PDF created");

      //Closing the document
      try {
        document.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

}

Answer:

to draw the image at a smaller size, use a scale factor:

contents.drawImage(pdImage, 20, 20, pdImage.getWidth() / 4, pdImage.getHeight() / 4); 

Question:

How to convert PDDocument (the pdf document contains words and images, if it is possible) to Base64 String? Is there any suggestion of code. Please.


Answer:

The answer assumes that you are using jdk8 or higher, if not, please see here.

import java.util.Base64;

...

ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos);
String base64String = Base64.getEncoder().encodeToString(baos.toByteArray());
doc.close(); // don't forget to close your document

Question:

I am trying to merge pdf files but getting error while opening the file. My code is :

    public void merge(){
        byte[] pdf1 = tobyte("hello");
        byte[] pdf2 = tobyte("world");
        PDFMergerUtility merger = new PDFMergerUtility();
        merger.addSource(new ByteArrayInputStream(pdf1));
        merger.addSource(new ByteArrayInputStream(pdf2));
        merger.setDestinationFileName("final.pdf");
        merger.mergeDocuments();
    }

    static byte[] tobyte(String message) {
        PDDocument doc = new PDDocument();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        doc.save(baos);
        return baos.toByteArray();
    }

Answer:

Here is the code that works

//Loading an existing PDF document
File file1 = new File("sample1.pdf");
PDDocument doc1 = null;
try {
    doc1 = PDDocument.load(file1);
} catch (IOException e1) {
    e1.printStackTrace();
}

File file2 = new File("sample2.pdf");
PDDocument doc2 = null;
try {
    doc2 = PDDocument.load(file2);
} catch (IOException e1) {
    e1.printStackTrace();
}

//Instantiating PDFMergerUtility class
PDFMergerUtility PDFmerger = new PDFMergerUtility();

//Setting the destination file
PDFmerger.setDestinationFileName("merged.pdf");

//adding the source files
PDFmerger.addSource(file1);
PDFmerger.addSource(file2);

//Merging the two documents
try {
    PDFmerger.mergeDocuments();
} catch (COSVisitorException | IOException e) {
    e.printStackTrace();
}

System.out.println("Documents merged");
//Closing the documents
try {
    doc1.close();
} catch (IOException e) {
    e.printStackTrace();
}
try {
    doc2.close();
} catch (IOException e) {
    e.printStackTrace();
}

Question:

I'm generating a PDDocument in Java with code like this...

HashMap<Integer, PDPageContentStream> mPageContentStreamMap = new HashMap<>();
PDDocument doc = new PDDocument();
for (int i = 1; i <= mNumPages; i++) {
        PDPage page = new PDPage(PDRectangle.A4);
        page.setRotation(90);
        PDPageContentStream pageContentStream = new PDPageContentStream(doc, page);
        contentStreamMap.put(i, pageContentStream);
        doc.addPage(page);
    }
}

Then later save and close the document like this...

for (int i : mPageContentStreamMap.keySet()) {
    mPageContentStreamMap.get(i).close();
}

doc.save("test-filename");
doc.close();

This works fine on the first run; however when I run my program multiple times I get the following error

java.io.IOException: Scratch file already closed
at org.apache.pdfbox.io.ScratchFile.checkClosed(ScratchFile.java:390)
at org.apache.pdfbox.io.ScratchFileBuffer.<init>(ScratchFileBuffer.java:78)
at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:403)
at org.apache.pdfbox.cos.COSStream.createOutputStream(COSStream.java:208)
at org.apache.pdfbox.pdmodel.common.PDStream.createOutputStream(PDStream.java:224)
at org.apache.pdfbox.pdmodel.PDPageContentStream.<init>(PDPageContentStream.java:259)
at org.apache.pdfbox.pdmodel.PDPageContentStream.<init>(PDPageContentStream.java:121)

If I re-run my program without the "doc.close();" line, this error goes away, but the output of the PDF is duplicated (i.e. a new PDF is generated, but with the content from the last PDF and the content from the current PDF).

Is there a way to close the stream and create multiple PDFs without running into the scratch file error?


Answer:

I had created a singleton object for my drawing logic meaning after the first run, the same objects were reused when they shouldn't've been, because the input (what was being drawn) had changed.

Question:

I have a requirement to add watermark text at run time, ie. at the time of document creation i need to add watermark text. My initial approach was to get all the pages from document and add my text on those pages. Id did work, but the problem is where ever my watermark message comes it is hiding my page content. Please see code for my initial approach.

  List pages = document.getDocumentCatalog().getAllPages();
    float fontSize = 70.0f;
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = (PDPage) pages.get(i);
        PDRectangle pageSize = page.findMediaBox();
        float stringWidth = pdfFont.getStringWidth(text) * fontSize
                / 1000f;
        // calculate to center of the page
        int rotation = page.findRotation();
        boolean rotate = degree > 0;
        float pageWidth = rotate ? pageSize.getHeight() : pageSize
                .getWidth();
        float pageHeight = rotate ? pageSize.getWidth() : pageSize
                .getHeight();
        double centeredXPosition = rotate ? pageHeight / 2f
                : (pageWidth - stringWidth) / 2f;
        double centeredYPosition = rotate ? (pageWidth - stringWidth) / 2f
                : pageHeight / 2f;
        // append the content to the existing stream
        PDPageContentStream contentStream = new PDPageContentStream(
                document, page, true, true, true);
        contentStream.beginText();
        // set font and font size
        contentStream.setFont(pdfFont, fontSize);
        // set text color to red
        contentStream.setNonStrokingColor(240, 240, 240);
        if (rotate) {
            // rotate the text according to the page rotation
            contentStream.setTextRotation(degree, x, y);
        } else {
            contentStream.setTextTranslation(centeredXPosition,
                    centeredYPosition);
        }
        contentStream.drawString(text);
        contentStream.endText();
        contentStream.close();

I have read about Overlay and I have tried it, so I tried changing my approach, bcz i think only Overlay can fulfill my requirement. My present approach is:

    public PDDocument createWatermarkText() {
    PDDocument watermarkDoc = new PDDocument();
    PDPage watermarkPage = new PDPage();
    try {

        watermarkDoc.addPage(watermarkPage);
        PDPageContentStream content = new PDPageContentStream(watermarkDoc,
                watermarkPage);
        content.setFont(pdfFont, fontSize);
        content.beginText();
        content.moveTextPositionByAmount(x, y);
        content.setNonStrokingColor(255, 0, 0);
        PDRectangle pageSize = watermarkPage.findMediaBox();
        float stringWidth = pdfFont.getStringWidth(text) * fontSize / 1000f;
        // int rotation = page.findRotation();
        boolean rotate = degree > 0;
        float pageWidth = rotate ? pageSize.getHeight() : pageSize
                .getWidth();
        float pageHeight = rotate ? pageSize.getWidth() : pageSize
                .getHeight();
        double centeredXPosition = rotate ? pageHeight / 2f
                : (pageWidth - stringWidth) / 2f;
        double centeredYPosition = rotate ? (pageWidth - stringWidth) / 2f
                : pageHeight / 2f;
        content.setTextRotation(degree, x, y);

        content.drawString(text);
        content.endText();
        content.close();
        // ColumnText.showTextAligned(writer.getDirectContentUnder(), align,
        // new Phrase(text, pdfFont), x, y,
        // degree);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return watermarkDoc;
}

and then calling this method

     PDDocument wDoc = createWatermarkText();
            //document.addPage(page);
             Overlay overlay = new Overlay();

             overlay.overlay(wDoc, document);

but this approach is not working out, and Im getting blank pdf. Any help higly appreciated.


Answer:

This answer attempts to make the original approach of the OP work.

The problem of the original approach,

where ever my watermark message comes it is hiding my page content.

is caused by the fact that the PDFBox PDPageContentStream constructors add the new stream as last content stream, so its operations also are executed last and consequentially cover content drawn before.

To push the new content under the existing content, therefore, we must move the new stream to the front position among the page content streams.

To be able to do so, I first change the existing code: I enclose the watermark drawing code in saveGraphicsState and restoreGraphicsState. This is necessary to protect the original content from being influenced by state changes by the the mark drawing code, e.g. text color changes.

...
PDPageContentStream contentStream = new PDPageContentStream(
        document, page, true, true, true);
contentStream.saveGraphicsState();
contentStream.beginText();
// set font and font size
contentStream.setFont(pdfFont, fontSize);
// set text color to red
contentStream.setNonStrokingColor(240, 240, 240);
if (rotate) {
    // rotate the text according to the page rotation
    contentStream.setTextRotation(degree, x, y);
} else {
    contentStream.setTextTranslation(centeredXPosition,
            centeredYPosition);
}
contentStream.drawString(text);
contentStream.endText();
contentStream.restoreGraphicsState();
contentStream.close();
...

With this change in place, we merely need to call the following method to push the watermark under the pre-existing content:

void pushUnder(PDDocument document)
{
    List<?> pages = document.getDocumentCatalog().getAllPages();
    float fontSize = 70.0f;
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = (PDPage) pages.get(i);
        COSBase contents = page.getCOSDictionary().getDictionaryObject(COSName.CONTENTS);
        if (contents instanceof COSStreamArray)
        {
            COSStreamArray contentsArray = (COSStreamArray) contents;
            COSArray newArray = new COSArray();
            newArray.add(contentsArray.get(0));
            newArray.add(contentsArray.get(contentsArray.getStreamCount() - 1));

            for (int j = 1; j < contentsArray.getStreamCount() - 1; j++)
            {
                newArray.add(contentsArray.get(j));
            }

            COSStreamArray newStreamArray = new COSStreamArray(newArray);
            page.getCOSDictionary().setItem(COSName.CONTENTS, newStreamArray);
        }
    }
}

(UnderlayText.java)

(If you look closely at the method, you'll see that we don't move the new stream to the first position but only to the second. We do so because the new PDPageContentStream(document, page, true, true, true) constructor call actually creates two new streams, one at the first position, one at the last one, the first one saving the graphics state, the last one restoring it and then containing your operations. Moving the latter before the former would result in a restore graphics state operations at the beginning which would be an error.)

Question:

Using 1.8.9

I want to cut a PDF page to a multi-page PDF using crop tools. But when I add more than one page to my PDDocument it doesn't add it at all.

Code example (the original PDPage is a parameter of my function) :

private static void splitPage(int nbOfCrops, PDPage myPage) throws IOException{

PDDocument pdfSplit = new PDDocument();
ArrayList<PDPage> pages = new ArrayList<PDPage>();


    float croppingHeight = (myPage.findCropBox().getUpperRightY()/nbOfCrops);

    for(int page = 0; page<nbOfCrops; page++){
        pages.add(myPage); //Creates multiple copies of myPage
    }

    int splits = 0;
    for(PDPage page: pages){
        PDRectangle cropBox = page.findCropBox();
        PDRectangle rectangle = new PDRectangle();

        rectangle.setUpperRightY((float) (cropBox.getUpperRightY() - (croppingHeight* (splits))));
        rectangle.setLowerLeftY((float) (cropBox.getUpperRightY() - (croppingHeight*(splits+ 1))));
        rectangle.setUpperRightX(cropBox.getUpperRightX());
        rectangle.setLowerLeftX(cropBox.getLowerLeftX());
        page.setCropBox(rectangle);

        pdfSplit.addPage(page);

        splits++;
    }
    try {
        pdfSplit.save("test.pdf");   
        System.out.println(pdfSplit.getNumberOfPages()); //Always returns 1
        pdfSplit.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

So, what should I do/modify to add correctly each cropped pages ?

My document (if you want to see what I want to do) :

http://www.cinemas-utopia.org/admin/grilles/toulouse/2015-06-02.pdf


Answer:

I did this, with some improvements to upscale the PDF (particular case).

    private static void cutAPDPage(int nbOfCrops, PDPage myPage, int resize) throws IOException{
    PDDocument pdfSplit = new PDDocument();
    ArrayList<PDPage> pages = new ArrayList<PDPage>();

    PDRectangle cropBox = myPage.findCropBox();
    PDRectangle newCropBox = new PDRectangle();
    newCropBox.setLowerLeftX(cropBox.getLowerLeftX());
    newCropBox.setLowerLeftY(cropBox.getLowerLeftY() - resize);
    newCropBox.setUpperRightX(cropBox.getUpperRightX());
    newCropBox.setUpperRightY(cropBox.getUpperRightY() + resize);
    myPage.setCropBox(newCropBox);


    PDRectangle mediaBox = myPage.findMediaBox();
    PDRectangle newMediaBox = new PDRectangle();
    newMediaBox.setLowerLeftX(mediaBox.getLowerLeftX());
    newMediaBox.setLowerLeftY(mediaBox.getLowerLeftY() - resize);
    newMediaBox.setUpperRightX(mediaBox.getUpperRightX());
    newMediaBox.setUpperRightY(mediaBox.getUpperRightY() + resize);
    myPage.setMediaBox(newMediaBox);

    float croppingHeight = (myPage.findCropBox().getUpperRightY()/nbOfCrops);

    for(int page = 0; page<nbOfCrops; page++){
        pages.add(new PDPage());
    }
    int splits = 0;
    for(PDPage page: pages){
        page = (PDPage) pdf.importPage(myPage);
        PDRectangle cropBox1 = page.findCropBox();
        PDRectangle rectangle = new PDRectangle();

        rectangle.setUpperRightY((float) (cropBox1.getUpperRightY() - (croppingHeight * (splits))));
        rectangle.setLowerLeftY((float) (cropBox1.getUpperRightY() - (croppingHeight*(splits + 1))));
        rectangle.setUpperRightX(cropBox1.getUpperRightX());
        rectangle.setLowerLeftX(cropBox1.getLowerLeftX());
        page.setCropBox(rectangle);

        pdfSplit.addPage(page);

        splits++;
    }
    try {
        pdfSplit.save("split.pdf");   
        pdfSplit.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

Works like a charm !

Question:

I get bytearrays of several pdfs from a backend source. I load all these bytearrays into PDDocuments and add them to a list, like this:

List<PDDocument> pdfs = new ArrayList<>();
for (...the amount of bytearrays...) {
    PDDocument pdf = PDDocument.load(bytearray);
    pdfs.add(pdf);
}

I then merge these pdfs into one PDDocument:

PDDocument mergedPdf = new PDDocument();
PDFMergerUtility PDFmerger = new PDFMergerUtility();
for(...all pdfs in list...) {
    PDFmerger.appendDocument(mergedPdf, pdf);
}

And then I save the mergedPdf to a file:

mergedPdf.save("c:\temp\mergeddoc.pdf");

My question is now: where do I call the close() method on these pddocuments? Is this after loading them? But then that means I can't work any further with them, because I have closed the pdfs. Or is this only needed at the end after I do the save?


Answer:

You're on the safest side if you call close() on the source documents after saving the destination document. There have been bugs in older PDFBox 2.0.* versions where the destination PDF still kept references on the source PDFs - usually these were tagged PDFs. The soon to be released (likely in March) version 2.0.14 has all of these bugs fixed, hopefully, and you can close the source PDF after calling appendDocument(). Obviously you can't call close() directly after loading because the document is needed for appendDocument().