Hot questions for Using PDFBox in merge

Question:

Input: A list of (e.g. 14) PDF/A-1b files with embedded fonts. Processing: Doing a simple merge with Apache PDFBOX. Result: 1 PDF/A-1b file with large (too large) file size. (It is almost the sum of the size of all the source files).

Question: Is there a way to reduce the file size of the resulting PDF? Idea: Remove redundant embedded fonts. But how to? And is it the right way to do?

Unfortunately the following code is not doing the job, but is highlighting the obvious problem.

try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
    List<COSName> collectedFonts = new ArrayList<>();
    PDPageTree pages = document.getDocumentCatalog().getPages();
    int pageNr = 0;
    for (PDPage page : pages) {
        pageNr++;
        Iterable<COSName> names = page.getResources().getFontNames();
        System.out.println("Page " + pageNr);
        for (COSName name : names) {
            collectedFonts.add(name);
            System.out.print("\t" + name + " - ");
            PDFont font = page.getResources().getFont(name);
            System.out.println(font + ", embedded: " + font.isEmbedded());
            page.getCOSObject().removeItem(COSName.F);
            page.getResources().getCOSObject().removeItem(name);
        }
    }
    document.save("E:/tmp/output.pdf");
}

The code produces an output like that:

Page 1
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true

Any help appreciated ...


Answer:

The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.

In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.

As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.

To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.

The optimization code

This is the method to call to optimize a PDDocument:

public void optimize(PDDocument pdDocument) throws IOException {
    Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
    for (int pass = 0; ; pass++) {
        int merges = mergeDuplicates(complexObjects);
        if (merges <= 0) {
            System.out.printf("Pass %d - No merged objects\n\n", pass);
            break;
        }
        System.out.printf("Pass %d - Merged objects: %d\n\n", pass, merges);
    }
}

(OptimizeAfterMerge method under test)

The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.

The following helper methods and classes collect the complex objects of a PDF and the references to each of them:

Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
    COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
    Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
    incomingReferences.put(catalogDictionary, new ArrayList<>());

    Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
    Set<COSBase> thisPass = new HashSet<>();
    while(!lastPass.isEmpty()) {
        for (COSBase object : lastPass) {
            if (object instanceof COSArray) {
                COSArray array = (COSArray) object;
                for (int i = 0; i < array.size(); i++) {
                    addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
                }
            } else if (object instanceof COSDictionary) {
                COSDictionary dictionary = (COSDictionary) object;
                for (COSName key : dictionary.keySet()) {
                    addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
                }
            }
        }
        lastPass = thisPass;
        thisPass = new HashSet<>();
    }
    return incomingReferences;
}

void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
    COSBase object = reference.getTo();
    if (object instanceof COSArray || object instanceof COSDictionary) {
        Collection<Reference> incoming = incomingReferences.get(object);
        if (incoming == null) {
            incoming = new ArrayList<>();
            incomingReferences.put(object, incoming);
            thisPass.add(object);
        }
        incoming.add(reference);
    }
}

(OptimizeAfterMerge helper methods findComplexObjects and addTarget)

interface Reference {
    public COSBase getFrom();

    public COSBase getTo();
    public void setTo(COSBase to);
}

static class ArrayReference implements Reference {
    public ArrayReference(COSArray array, int index) {
        this.from = array;
        this.index = index;
    }

    @Override
    public COSBase getFrom() {
        return from;
    }

    @Override
    public COSBase getTo() {
        return resolve(from.get(index));
    }

    @Override
    public void setTo(COSBase to) {
        from.set(index, to);
    }

    final COSArray from;
    final int index;
}

static class DictionaryReference implements Reference {
    public DictionaryReference(COSDictionary dictionary, COSName key) {
        this.from = dictionary;
        this.key = key;
    }

    @Override
    public COSBase getFrom() {
        return from;
    }

    @Override
    public COSBase getTo() {
        return resolve(from.getDictionaryObject(key));
    }

    @Override
    public void setTo(COSBase to) {
        from.setItem(key, to);
    }

    final COSDictionary from;
    final COSName key;
}

(OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference)

And the following helper methods and classes finally identify and merge duplicates:

int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
    List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
    for (COSBase object : complexObjects.keySet()) {
        hashes.add(new HashOfCOSBase(object));
    }
    Collections.sort(hashes);

    int removedDuplicates = 0;
    if (!hashes.isEmpty()) {
        int runStart = 0;
        int runHash = hashes.get(0).hash;
        for (int i = 1; i < hashes.size(); i++) {
            int hash = hashes.get(i).hash;
            if (hash != runHash) {
                int runSize = i - runStart;
                if (runSize != 1) {
                    System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
                    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
                }
                runHash = hash;
                runStart = i;
            }
        }
        int runSize = hashes.size() - runStart;
        if (runSize != 1) {
            System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
            removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
        }
    }
    return removedDuplicates;
}

int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
    int removedDuplicates = 0;

    List<List<COSBase>> duplicateSets = new ArrayList<>();
    for (HashOfCOSBase entry : run) {
        COSBase element = entry.object;
        for (List<COSBase> duplicateSet : duplicateSets) {
            if (equals(element, duplicateSet.get(0))) {
                duplicateSet.add(element);
                element = null;
                break;
            }
        }
        if (element != null) {
            List<COSBase> duplicateSet = new ArrayList<>();
            duplicateSet.add(element);
            duplicateSets.add(duplicateSet);
        }
    }

    System.out.printf("Identified %d set(s) of identical objects in run.\n", duplicateSets.size());

    for (List<COSBase> duplicateSet : duplicateSets) {
        if (duplicateSet.size() > 1) {
            COSBase surviver = duplicateSet.remove(0);
            Collection<Reference> surviverReferences = complexObjects.get(surviver);
            for (COSBase object : duplicateSet) {
                Collection<Reference> references = complexObjects.get(object);
                for (Reference reference : references) {
                    reference.setTo(surviver);
                    surviverReferences.add(reference);
                }
                complexObjects.remove(object);
                removedDuplicates++;
            }
            surviver.setDirect(false);
        }
    }

    return removedDuplicates;
}

boolean equals(COSBase a, COSBase b) {
    if (a instanceof COSArray) {
        if (b instanceof COSArray) {
            COSArray aArray = (COSArray) a;
            COSArray bArray = (COSArray) b;
            if (aArray.size() == bArray.size()) {
                for (int i=0; i < aArray.size(); i++) {
                    if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
                        return false;
                }
                return true;
            }
        }
    } else if (a instanceof COSDictionary) {
        if (b instanceof COSDictionary) {
            COSDictionary aDict = (COSDictionary) a;
            COSDictionary bDict = (COSDictionary) b;
            Set<COSName> keys = aDict.keySet();
            if (keys.equals(bDict.keySet())) {
                for (COSName key : keys) {
                    if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
                        return false;
                }
                // In case of COSStreams we strictly speaking should
                // also compare the stream contents here. But apparently
                // their hashes coincide well enough for the original
                // hashing equality, so let's just assume...
                return true;
            }
        }
    }
    return false;
}

static COSBase resolve(COSBase object) {
    while (object instanceof COSObject)
        object = ((COSObject)object).getObject();
    return object;
}

(OptimizeAfterMerge helper methods mergeDuplicates, mergeRun, equals, and resolve)

static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
    public HashOfCOSBase(COSBase object) throws IOException {
        this.object = object;
        this.hash = calculateHash(object);
    }

    int calculateHash(COSBase object) throws IOException {
        if (object instanceof COSArray) {
            int result = 1;
            for (COSBase member : (COSArray)object)
                result = 31 * result + member.hashCode();
            return result;
        } else if (object instanceof COSDictionary) {
            int result = 3;
            for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
                result += entry.hashCode();
            if (object instanceof COSStream) {
                try (   InputStream data = ((COSStream)object).createRawInputStream()   ) {
                    MessageDigest md = MessageDigest.getInstance("MD5");
                    byte[] buffer = new byte[8192];
                    int bytesRead = 0;
                    while((bytesRead = data.read(buffer)) >= 0)
                        md.update(buffer, 0, bytesRead);
                    result = 31 * result + Arrays.hashCode(md.digest());
                } catch (NoSuchAlgorithmException e) {
                    throw new IOException(e);
                }
            }
            return result;
        } else {
            throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
        }
    }

    final COSBase object;
    final int hash;

    @Override
    public int compareTo(HashOfCOSBase o) {
        int result = Integer.compare(hash,  o.hash);
        if (result == 0)
            result = Integer.compare(hashCode(), o.hashCode());
        return result;
    }
}

(OptimizeAfterMerge helper class HashOfCOSBase)

Applying the code to the OP's example document

The OP's example document is about 6.5 MB in size. Applying the above code like this

PDDocument pdDocument = PDDocument.load(SOURCE);

optimize(pdDocument);

pdDocument.save(RESULT);

results in a PDF less than 700 KB in size, and it appears to be complete.

(If something's missing, please tell, I'll try and fix that.)

Words of warning

On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.

On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.

Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument, PDDocumentCatalog, PDAcroForm, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado.

Question:

I'm Using PdfBox for android in order to append data to a PDF file.

The data to append

public byte [] prerparePdfToAppend() {

    final PDDocument document = new PDDocument();
    final PDPage sourcePage = new PDPage();
    document.addPage(sourcePage);

    PDPageContentStream contentStream = new PDPageContentStream(document, sourcePage);
    contentStream.beginText();
    contentStream.setFont(PDType1Font.COURIER, 12);
    contentStream.showText("Name: " + firstName + " " + lastName);
    contentStream.newLine();
    ...
    contentStream.endText();
    contentStream.close();

    output = new ByteArrayOutputStream();
    document.save(output);
    document.close();
    byte [] bytesToAppend = new byte[output.size()];
    output.write(bytes);
    output.close();

    return bytesToAppend;
}

Merge Code (simplified)

public void merge (String assetFileName) {
    byte [] toAppendPdf = prerparePdfToAppend();
    PDFMergerUtility mergerUtility = new PDFMergerUtility();
    mergerUtility.addSource(PDFBoxResourceLoader.getStream(assetFileName));
    mergerUtility.addSource(new ByteArrayInputStream(toAppendPdf));
    mergerUtility.setDestinationStream(destStream);
    mergerUtility.mergeDocuments(); //IOException
}

The Exception

java.io.IOException: Error: End-of-File, expected line
   at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1419)
   at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:1648)
   at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1627)
   at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:348)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:888)
   at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:802)
   at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:172)

Answer:

The last lines of the prerparePdfToAppend method look weird to me. But why make your life complicated? Return a PDDocument:

public PDDocument prerparePdfToAppend()
{
    final PDDocument document = new PDDocument();
    final PDPage sourcePage = new PDPage();
    document.addPage(sourcePage);

    PDPageContentStream contentStream = new PDPageContentStream(document, sourcePage);
    contentStream.beginText();
    contentStream.setFont(PDType1Font.COURIER, 12);
    contentStream.showText("Name: " + firstName + " " + lastName);
    contentStream.newLine();
    ...
    contentStream.endText();
    contentStream.close();

    return document;
}

Your merge code would then look like this:

public void merge (String assetFileName)
{
    PDFMergerUtility mergerUtility = new PDFMergerUtility();
    PDDocument srcDoc = PDDocument.load(PDFBoxResourceLoader.getStream(assetFileName));
    PDDocument dstDoc = prerparePdfToAppend();
    mergerUtility.appendDocument(dstDoc, srcDoc);
    dstDoc.save(destStream);
    srcDoc.close();
    dstDoc.close();
}

If this doesn't work - make sure that

PDFBoxResourceLoader.getStream(assetFileName)

is really the stream of a real PDF. If it still doesn't work, mention which line of this new code produces the exception. And of course, make sure you're using the latest version of PDFBox.

Question:

I have problem with extracting text from PDF using PDFTextStripper from PDFBox 2.0.13. To be more specific - lines, which are too close to each other, are merged together. For example:

On the first line, there is text "signfieldbig", the second line contains underscores but PDFTextStripper parsed it as "s_i_g_n_fi_e_ld_b_ig_ _______" (it merged both lines into one). I tried multiple settings (different lineSeparator, tresholds, etc..) but nothing helped. These two lines were merged every time and I cannot simply remove all unnecessary characters from text, because I am looking for position of this placeholder to create signature field.

UPDATE: I just realized what caused this problem - in original file aren't two normal lines separated by line separator but one line with underscores and manually placed text area with text "placeholder" above that. But still, PDF viewer (viewing it as text) or other PDF library (iText 2.x) parse it as two separate lines...


Answer:

There are different strategies to text extraction, one can either take the text chunks as they come and only add a new line or something similar when the new next chunk's coordinates are not right after the previous one, or one can collect all chunks, sort them by coordinates, and extract the text from these sorted chunks.

(Obviously both strategy types can be combined with a certain degree of analysis of text layout.)

In your case sorting is active, causing the underscores and the text above to be joined as "s_i_g_n_fi_e_ld_b_ig_ _______".

You can disable sorting in the pdfbox text stripper using setSortByPosition(false).


There is no universal best approach, depending on the document in question one or the other might be better.

Question:

I have been attempting to merge multiple, smaller PDF's (6 MB is the largest I have used thus far), into a single PDF. Any time I attempt to use more than 14 MB of input, I get an Out of Memory error.

When merging, the memory usage for the process jumps to over 550MB. That seems excessive for 14MB of input.

I am running this application on an IBM Websphere Application Server, locally, using PDFBox version 1.8.5

I have increased the heap size to 1024MB, and while this allowed me to use a few more files for input, I quickly run up against the same issue.

At the suggestion of a commenter, I have changed the methodology to merge pairs of documents together then further merge previously merged pairs. This allowed me to get further than I had before. I still get an Out of Memory error with files around 30 MB in size, but it is much more workable.

File sourceLoc = new File(System.getProperty("java.io.tmpdir") + "source_files");
File scratch = new File(System.getProperty("java.io.tmpdir") + "scratch.txt");
PDFMergerUtility merger = new PDFMergerUtility();

merger.setDestinationFileName(System.getProperty("java.io.tmpdir") + "merged.pdf");
for(File file : sourceLoc.listFiles())
    merger.addSource(file);
merger.mergeDocumentsNonSeq(new org.apache.pdfbox.io.RandomAccessFile(scratch, "rw"));

This is the log generated:

JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at      2014/08/01 13:01:50 - please wait.
JVMDUMP032I JVM requested System dump using 'C:\Working\IntranetApps\I-Document\Services\core.20140801.130150.2408.0001.dmp' in response to an event
JVMDUMP010I System dump written to C:\Working\IntranetApps\I-Document\Services\core.20140801.130150.2408.0001.dmp
JVMDUMP032I JVM requested Heap dump using 'C:\Working\IntranetApps\I-Document\Services\heapdump.20140801.130150.2408.0002.phd' in response to an event
JVMDUMP010I Heap dump written to C:\Working\IntranetApps\I-Document\Services\heapdump.20140801.130150.2408.0002.phd
JVMDUMP032I JVM requested Java dump using 'C:\Working\IntranetApps\I-Document\Services\javacore.20140801.130150.2408.0003.txt' in response to an event
JVMDUMP010I Java dump written to C:\Working\IntranetApps\I-Document\Services\javacore.20140801.130150.2408.0003.txt
JVMDUMP032I JVM requested Snap dump using 'C:\Working\IntranetApps\I-Document\Services\Snap.20140801.130150.2408.0004.trc' in response to an event
JVMDUMP010I Snap dump written to C:\Working\IntranetApps\I-Document\Services\Snap.20140801.130150.2408.0004.trc
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.pdfbox.io.RandomAccessBuffer.clone(RandomAccessBuffer.java:69)
    at org.apache.pdfbox.cos.COSStream.clone(COSStream.java:72)
    at org.apache.pdfbox.cos.COSStream.<init>(COSStream.java:96)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseCOSStream(NonSequentialPDFParser.java:1513)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1266)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1192)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1166)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:479)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:740)
    at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1306)
    at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1289)
    at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:232)
    at org.apache.pdfbox.util.PDFMergerUtility.mergeDocumentsNonSeq(PDFMergerUtility.java:201)
    at com.my.pkg.MyMergeClass.main(MyMergeClass.java:90)

Answer:

PDFs are mostly postscript, which is a language of its own... so 14MB of input can be anything from zero to infinite output. Your best bet is to just figure out how to handle running OOM well.

Question:

I'm merging two PDFs with digital signatures using PDFBox. However, the result is a pdf with the signatures in the document, but when opened in Acrobat Reader it shows that there aren't valid signatures as shown:

How can I keep the signatures?


Answer:

Signatures are used to authenticate that the PDF file is an original unmodified file.

When you merge files, you've modified them, so they are no longer originals, and hence the signature is invalid, as it should be.

You should remove the signatures to prevent the warning, and leave the merged PDF unsigned.

For an overview of changes allowed to a signed document see this answer.

Question:

My JavaFX application was downloading PDFs from the server, rotate to portrait if the PDF is landscape, and then merge all the PDF files into one single PDF file to print it out.

Everything went fine except the program will randomly stuck at outputting the merged PDF or adding one of the PDF files to PDFMergerUtility(which I am using PDFBox 2.0.11 and tried 2.0.9 also). Because my application requires a ProgressBar and TextArea to show the current action or status, I used a Task in my controller page. When the program hangs, it didn't enter any exception or print any message but completely stops the background action. I have tried small amount of files (<50 files) and large file tests (>1000), but they all have the same results of absolutely normal or randomly hangs.

Below are the code of my controller program:

public class ReadDataPageController implements Initializable {
    public long startTime;
    public long stopTime;
    @FXML
    private Button btnNext, btnCancel, btnPrevious;
    @FXML
    private Label infoLabel, time, total;
    @FXML
    private ProgressBar progBar;
    @FXML
    private TextArea textArea;

    public Task<String> dlTask() {
        return new Task<String>() {

            @Override
            protected String call() throws Exception {
                DownloadUtil dlutil = new DownloadUtil();
                StringBuilder textStr = new StringBuilder();
                List<String> dlList = mainApp.DL_LIST;

                // Download PDF files from FTP
                super.updateValue(textStr.append("Preparing files for download...\n").toString());
                for (int count = 0; count < dlList.size(); count++) {
                    String PDFLink = dlList.get(count).getPDFLink();
                    super.updateTitle("Downloading file" + PDFLink + " ...");
                    super.updateValue(textStr.append("Got " + PDFLink + "\n").toString());

                    try {
                        dlutil.exec(PDFLink);
                        // downloaded location will be stored inside List DownloadUtil.pdfList
                    } catch (IndexOutOfBoundsException ex) {
                        super.updateValue(textStr.append("Link not found for " + PDFLink + "\n").toString());
                    } catch (Exception ex) {
                        super.updateValue(textStr.append("Error while downloading " + PDFLink + " :" + ex.getMessage() + "\n").toString());
                    }
                    super.updateProgress(count + 1, dlList.size() * 3);
                }
                super.updateProgress(dlList.size(), dlList.size() * 3);
                super.updateTitle("Download action has finished.");
                super.updateValue(textStr.append("Download action has finished.\n").toString());

                // Rotate downloaded PDFs
                super.updateTitle("Preparing files for PDF rotation...");
                super.updateValue(textStr.append("Preparing files for PDF rotation...\n").toString());
                for (int i = 0; i < dlutil.pdfList.size(); i++) {
                    try {
                        String fileName = dlutil.pdfList.get(i);
                        rotatePDF(new File(fileName));
                        super.updateValue(textStr.append("Rotating PDF ("+(i+1)+" of "+dlutil.pdfList.size()+")...\n").toString());
                    } catch (Exception ex) {
                        super.updateValue(textStr.append("Error:" + ex.getMessage() + "...\n").toString());
                        ex.printStackTrace();
                    }
                    super.updateProgress(dlutil.pdfList.size() + i + 1, dlutil.pdfList.size() * 3);
                }

                if (PRINT_OPTION == PrintType.PRINT) {
                    // Merge downloaded PDFs
                    super.updateValue(textStr.append("Preparing files for PDF merging action...\n").toString());
                    PDFMergerUtility pdfutil = new PDFMergerUtility();
                    for (int i = 0; i < dlutil.pdfList.size(); i++) {
                        try {
                            String fileName = dlutil.pdfList.get(i);
                            pdfutil.addSource(fileName);
                            super.updateTitle("Adding files (" + (i + 1) + "/" + dlutil.pdfList.size() + ")");
                        } catch (Exception ex) {
                            super.updateValue(textStr.append("Error:" + ex.getMessage() + "...\n").toString());
                            ex.printStackTrace();
                        }
                        super.updateProgress(dlutil.pdfList.size()*2 + i + 1, dlutil.pdfList.size() * 3);
                    }
                    // Output merged pdf
                    try {
                        pdfutil.setDestinationFileName("../odt/merge.pdf");
                        pdfutil.mergeDocuments();
                    } catch (Exception ex) {
                        ex.printStackTrace();
                    }
                    super.updateTitle("Merged all PDFs.");
                }

                super.updateProgress(100, 100);
                super.updateTitle("All action has been finished.");
                super.updateValue(textStr.append("All action has been finished, press Next to choose your printing option.\n").toString());
                return textStr.toString();
            }
        };
    }

    /**
     * Rotates PDF images 90 degree if the PDF is portrait
     * @param resource the PDF file path
     * @throws InvalidPasswordException
     * @throws IOException
     */
    public void rotatePDF(File resource) throws InvalidPasswordException, IOException {
        try {
            PDDocument document = PDDocument.load(resource);
            int pageCount = document.getNumberOfPages();
            System.out.println("Reading file: "+resource+", total page="+pageCount);
            for (int i = 0; i < pageCount; i++) {
                PDPage page = document.getDocumentCatalog().getPages().get(i);
                PDPageContentStream cs = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.PREPEND,
                        false, false);
                Matrix matrix = Matrix.getRotateInstance(Math.toRadians(90), 0, 0);
                cs.transform(matrix);
                cs.close();

                PDRectangle cropBox = page.getCropBox();
                if (cropBox.getWidth() > cropBox.getHeight()) {
                    System.out.println("ROTATE "+i+"th");
                    Rectangle rectangle = cropBox.transform(matrix).getBounds();
                    PDRectangle newBox = new PDRectangle((float) rectangle.getX(), (float) rectangle.getY(),
                            (float) rectangle.getWidth(), (float) rectangle.getHeight());
                    page.setCropBox(newBox);
                    page.setMediaBox(newBox);
                    document.save(resource);
                }
            }
            document.close();
        } catch (Exception ex) {
            ex.printStackTrace();
            throw ex;
        }
    }
}

Is there any reason that may cause the PDFMergerUtility unstable, maybe because I used a Task outside or because I missed something crucial?


Answer:

Bingo! The exception was OutOfMemoryError, and Task from JavaFX made it silence.

I added the following code while initiating the task and it will handle the exceptions:

task.setOnFailed(new EventHandler<WorkerStateEvent>(){
    @Override
    public void handle(WorkerStateEvent event) {
        Throwable th = task.getException();
        System.out.println("Error on Task:"+th.getMessage());
        th.printStackTrace();
    }
});

To avoid OutOfMemoryError, I split the merging job into 100 pages per merging job, and save as multiple merged PDF files.

Question:

I compare 2 pdf files and mark highlight on them. When i using pdfbox to merge it for comparison . It have error missing highlight.

I using this function: The function to merge 2 file pdfs with all pages of them to side by side.

function void generateSideBySidePDF() {
    File pdf1File = new File(FILE1_PATH);
    File pdf2File = new File(FILE2_PATH);
    File outPdfFile = new File(OUTFILE_PATH);
    PDDocument pdf1 = null;
    PDDocument pdf2 = null;
    PDDocument outPdf = null;
    try {

        pdf1 = PDDocument.load(pdf1File);
        pdf2 = PDDocument.load(pdf2File);

        outPdf = new PDDocument();
        for(int pageNum = 0; pageNum < pdf1.getNumberOfPages(); pageNum++) {
            // Create output PDF frame
            PDRectangle pdf1Frame = pdf1.getPage(pageNum).getCropBox();
            PDRectangle pdf2Frame = pdf2.getPage(pageNum).getCropBox();
            PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));

            // Create output page with calculated frame and add it to the document
            COSDictionary dict = new COSDictionary();
            dict.setItem(COSName.TYPE, COSName.PAGE);
            dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
            dict.setItem(COSName.CROP_BOX, outPdfFrame);
            dict.setItem(COSName.ART_BOX, outPdfFrame);
            PDPage outPdfPage = new PDPage(dict);
            outPdf.addPage(outPdfPage);

            // Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
            LayerUtility layerUtility = new LayerUtility(outPdf);
            PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdf1, pageNum);
            PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdf2, pageNum);

            // Add form objects to output page
            AffineTransform afLeft = new AffineTransform();
            layerUtility.appendFormAsLayer(outPdfPage, formPdf1, afLeft, "left" + pageNum);
            AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
            layerUtility.appendFormAsLayer(outPdfPage, formPdf2, afRight, "right" + pageNum);
        }
        outPdf.save(outPdfFile);
        outPdf.close();

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if (pdf1 != null) pdf1.close();
            if (pdf2 != null) pdf2.close();
            if (outPdf != null) outPdf.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Answer:

Insert this into your code after the "Source PDF pages has to be imported" segment to copy the annotations. The ones of the right PDF must have their rectangle moved.

// copy annotations
PDPage src1Page = pdf1.getPage(pageNum);
PDPage src2Page = pdf2.getPage(pageNum);
for (PDAnnotation ann : src1Page.getAnnotations())
{
    outPdfPage.getAnnotations().add(ann);                
}
for (PDAnnotation ann : src2Page.getAnnotations())
{
    PDRectangle rect = ann.getRectangle();
    ann.setRectangle(new PDRectangle(rect.getLowerLeftX() + pdf1Frame.getWidth(), rect.getLowerLeftY(), rect.getWidth(), rect.getHeight()));
    outPdfPage.getAnnotations().add(ann);                
}

Note that this code has a flaw - it works only with annotations WITH appearance stream (most have it). It will have weird effects for those that don't, in that case, one would have to adjust the coordinates depending on the annotation type. For highlights, it would be the quadpoints, for line it would be the line coordinates, etc, etc.

Question:

I'm attempting to use the Apache PDFMergerUtility, following one of the many online examples. My problem is I get an NoClassDefFoundError exception while creating an instance: PDFMergerUtility pdfMerger = new PDFMergerUtility();

Here's my import: import org.apache.pdfbox.multipdf.PDFMergerUtility;

I downloaded the jar file pdfbox-2.0.15.jar and put in my Netbeans Project libraries.

What might be the cause?

I've read the Apache documentation, and looked at numerous examples that show the exact code.


Answer:

You also need fontbox and commons-log, as described on the dependency page. Alternatively, you can use pdfbox-app, which has all needed for merging.

Question:

Can I merge elements from two PDF to a final PDF with PDFBox (or other library)?

I'm not looking for page concatenation but merging page elements:


Answer:

The task of the OP is to merge two pages into one, keeping each object at its present location on the page.

For doing this PDFBox provides the Overlay class. Given two PDDocument instances document1 and document2, you can simply do

Overlay overlay = new Overlay();
overlay.setOverlayPosition(Overlay.Position.FOREGROUND);
overlay.setInputPDF(document1);
overlay.setAllPagesOverlayPDF(document2);

Map<Integer, String> ovmap = new HashMap<Integer, String>();            
overlay.overlay(ovmap);

document1.save("");

overlay.close();

to overlay the second over the first document.

But the Overlay can be used for more complex overlaying tasks. In particular it allows you to also set specific PDFs to overlay only odd, only even pages, or only explicitly specified pages with.

As an example have a look at the source of the PDFBox tool OverlayPDF.


A word of warning, though: Only the page content of the extra documents is used for overlaying, all kinds of annotations are ignored. Also don't expect tags to be copied.

Question:

Can I merge elements from two PDF to a final PDF with PDFBox (or other library)?

I'm not looking for page concatenation but merging page elements:


Answer:

The task of the OP is to merge two pages into one, keeping each object at its present location on the page.

For doing this PDFBox provides the Overlay class. Given two PDDocument instances document1 and document2, you can simply do

Overlay overlay = new Overlay();
overlay.setOverlayPosition(Overlay.Position.FOREGROUND);
overlay.setInputPDF(document1);
overlay.setAllPagesOverlayPDF(document2);

Map<Integer, String> ovmap = new HashMap<Integer, String>();            
overlay.overlay(ovmap);

document1.save("");

overlay.close();

to overlay the second over the first document.

But the Overlay can be used for more complex overlaying tasks. In particular it allows you to also set specific PDFs to overlay only odd, only even pages, or only explicitly specified pages with.

As an example have a look at the source of the PDFBox tool OverlayPDF.


A word of warning, though: Only the page content of the extra documents is used for overlaying, all kinds of annotations are ignored. Also don't expect tags to be copied.