Hot questions for Using PDFBox in adobe

Question:

I am trying to learn to use Apache's pdfBox to deal with digitaly signed documents for work. During testing, I created a completely empty pdf document.

I then signed the document through Adobe reader using the sign with certificate function.

I tried to open, save and close the signed file with pdfBox without any modifications. However once I open the file in Adobe the files are no longer valid.

Adobe tells me: "There are errors in the formatting or information contained in this signature (support information: SigDict/Contents illegal data)"

Since I have not modified the content of the file, intuitively there should not have been any problems and the signature should be still valid, however this is not the case and I don't know what the solutions are (googling yielded no results).

How I create the document:

@Test
public void createEmptyPDF() throws IOException {
    String path = "path to file";
    PDDocument document = new PDDocument();
    PDPage page = new PDPage();
    document.addPage(page);
    document.save(path);
    document.close();
}

I then sign it with adobe and pass it through this:

 @Test
public void copySignedDocument() throws IOException {
    String path = "path to file";
    File file = new File(path);
    PDDocument document = PDDocument.load(file);
    document.save(file);
    document.close();

    //just opening and saving the file invalidates the signatures
}

I am truly at a loss as to why this does not work. Any help would be great!

EDIT:

So I did some digging around and it seems that updating an existing signed document (either adding annotations or filling forms) is not yet implemented in PDFBox 2.0.1 and is scheduled to come in versions 2.1 (however no release date has been specified). More information here and here.

However it seems possible to add annotations on signed documents with IText without invalidating the signature using PDFStamper, from this question

EDIT 2: Code to add a stamp to a document and save it incrementally:

 @Test
public void stampSignedDocument() throws IOException {
    File file = new File("path to file");
    PDDocument document = PDDocument.load(file);
    File image = new File("path to image to be added to annotation");
    PDPage page = document.getPage(0);
    List<PDAnnotation> annotations = page.getAnnotations();
    PDImageXObject ximage = PDImageXObject.createFromFileByContent(image, document);

    //stamp
    PDAnnotationRubberStamp stamp = new PDAnnotationRubberStamp();
    stamp.setName("testing rubber stamp");
    stamp.setContents("this is a test");
    stamp.setLocked(true);
    stamp.setReadOnly(true);
    stamp.setPrinted(true);

    PDRectangle rectangle = createRectangle(100, 100, 100, 100, 100, 100);
    PDFormXObject form = new PDFormXObject(document);
    form.setResources(new PDResources());
    form.setBBox(rectangle);
    form.setFormType(1);

    form.getResources().add(ximage);
    PDAppearanceStream appearanceStream = new PDAppearanceStream(form.getCOSObject());
    PDAppearanceDictionary appearance = new PDAppearanceDictionary(new COSDictionary());
    appearance.setNormalAppearance(appearanceStream);
    stamp.setAppearance(appearance);
    stamp.setRectangle(rectangle);
    PDPageContentStream stream = new PDPageContentStream(document, appearanceStream);
    Matrix matrix = new Matrix(100, 0, 0, 100, 100, 100);
    stream.drawImage(ximage, matrix);
    stream.close();
    //close and save   
    annotations.add(stamp);
    page.getCOSObject().setNeedToBeUpdated(true);
    OutputStream os = new FileOutputStream(file);
    document.saveIncremental(os);
    document.close();
    os.close();
}

The above code doesn't invalidate my signature but doesn't save the annotation that I have added.

As suggested I've set the NeedToBeUpdated flag to true for the added annotation, page and annotations list (I hope I did the last one correctly):

    stamp.getCOSObject().setNeedToBeUpdated(true);
    COSArrayList<PDAnnotation> list = (COSArrayList<PDAnnotation>) annotations;
    COSArrayList.converterToCOSArray(list).setNeedToBeUpdated(true);
    page.getCOSObject().setNeedToBeUpdated(true);
    document.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);

The annotation is still not saved so I'm obviously missing something.

EDIT 3:

This is my current method to add an annotation:

    @Test
public void stampSignedDocument() throws IOException {
    File file = new File(
            "E:/projects/eSign/g2digitalsignature/G2DigitalSignatureParent/G2DigitalSignatureTest/src/test/resources/pdfBoxTest/empty.pdf");
    PDDocument document = PDDocument.load(file);
    File image = new File(
            "E:/projects/eSign/g2digitalsignature/G2DigitalSignatureParent/G2DigitalSignatureTest/src/test/resources/pdfBoxTest/digitalSign.png");
    PDPage page = document.getPage(0);
    List<PDAnnotation> annotations = page.getAnnotations();
    PDImageXObject ximage = PDImageXObject.createFromFileByContent(image, document);

    //stamp
    PDAnnotationRubberStamp stamp = new PDAnnotationRubberStamp();
    stamp.setName("testing rubber stamp");
    stamp.setContents("this is a test");
    stamp.setLocked(true);
    stamp.setReadOnly(true);
    stamp.setPrinted(true);

    PDRectangle rectangle = createRectangle(100, 100, 100, 100, 100, 100);
    PDFormXObject form = new PDFormXObject(document);
    form.setResources(new PDResources());
    form.setBBox(rectangle);
    form.setFormType(1);

    form.getResources().getCOSObject().setNeedToBeUpdated(true);
    form.getResources().add(ximage);
    PDAppearanceStream appearanceStream = new PDAppearanceStream(form.getCOSObject());
    PDAppearanceDictionary appearance = new PDAppearanceDictionary(new COSDictionary());
    appearance.setNormalAppearance(appearanceStream);
    stamp.setAppearance(appearance);
    stamp.setRectangle(rectangle);
    PDPageContentStream stream = new PDPageContentStream(document, appearanceStream);
    Matrix matrix = new Matrix(100, 0, 0, 100, 100, 100);
    stream.drawImage(ximage, matrix);
    stream.close();
    //close and save   
    annotations.add(stamp);

    appearanceStream.getCOSObject().setNeedToBeUpdated(true);
    appearance.getCOSObject().setNeedToBeUpdated(true);
    rectangle.getCOSArray().setNeedToBeUpdated(true);
    stamp.getCOSObject().setNeedToBeUpdated(true);
    form.getCOSObject().setNeedToBeUpdated(true);
    COSArrayList<PDAnnotation> list = (COSArrayList<PDAnnotation>) annotations;
    COSArrayList.converterToCOSArray(list).setNeedToBeUpdated(true);
    document.getPages().getCOSObject().setNeedToBeUpdated(true);
    page.getCOSObject().setNeedToBeUpdated(true);
    document.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);

    OutputStream os = new FileOutputStream(file);
    document.saveIncremental(os);
    document.close();
    os.close();

}

When I add an annotation using it on a non signed document, the annotation gets added and is visible. However when using it on a signed document, the annotation does not appear.

I have opened the pdf file in notepad++ and have found that the annotation seems to have been added since I found this as well as the rest of the code pertaining to the annotation:

<<
/Type /Annot
/Subtype /Stamp
/Name /testing#20rubber#20stamp
/Contents (this is a test)
/F 196
/AP 29 0 R
/Rect [100.0 100.0 200.0 200.0]
>>

However it does not appear when I open the document in adobe reader. Perhaps this has more to do with the appearance streams than the annotation itself?


Answer:

The problem is that using PDDocument.save() creates a new document and thus invalidates the signature.

Using PDDocument.saveIncremental(...) does not invalidate the signature, however it will not update any changes to the document (such as annotations or form filling), it is only used to save a signature.

Updating a signed PDF document with annotations or form filling is not yet possible with PDFBox 2.0 but should be possible once PDFBox 2.1 rolls out.

Information on the problem : here and here

Using IText's PDFStamper however solve the problem of adding annotations to a signed document without invalidating the signature as answered here.

Question:

I have a question with regard to pdfbox 1.8.13. I am trying to read in the entire text from a one page PDF document. Adobe Reader can do the job, pdfbox reads almost the entire page but scrambles the first two lines of the document and the last two lines of the document so that letters are interchanged.

Does anybody know how to solve such an issue? First, where to ask, second, how can I share the PDF with you, and Third, does someone have the possibility to check whether the problem also exists in version 2.0.7 of pdfbox, which I understodd is completely different and thus not straightforward to implement?

Thank you in advance for your help Stephan

Adobe Reader:

ScalableCapitalHRB217778,AmtsgerichtMünchenSeite1von1 
VermögensverwaltungGmbHUSt-IdNr.DE300434774 
Prinzregentenstr. 
48Geschäftsführung:80538München 
ErikPodzuweit,FlorianPrucker 

pdfbox:

SVecramlaöbgleenCsavpeitrawlaltung GmbH UHSRtB-I2d1N7r7.7D8E,3A0m0t4s3g4e7ri7c4ht München Seite 1 von 1
8P0ri5n3zr8egMeünntcehnesntr. 48 GEreikscPhoädftzsufwüheritu,nFglo: rian Prucker

Link to the PDF (I have verified that the problem is the same with the unmodified and the modified PDF that I have uploaded):

https://wetransfer.com/downloads/5930649bce9a1d1a686a0da63f1b9bce20170808071518/9b9140

P.S.: In the meanwhile, I have also tried the PDDocument.loadNonSeq version in pdfbox 1.8.13 but this resulted in the same problem.


Answer:

Thank you @tilman-hausherr for your helpful hints. With them, I managed to debug my problem.

You were right that leaving out the sorting option (I don't know why it was used before in the project that I now work on) resolved the scrambling issue even in pdfbox-1.8.13. And you were right that the text extraction result using pdfbox-2.0.7 gave even better results.

The relevant Java code snippets that I was using with pdfbox-1.8.13 were:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
...
PDDocument doc = PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
textStripper.setSortByPosition(true);
String text = textStripper.getText(doc);

If I understand correctly, the API for simple text extraction going from pdfbox-1.8.13 to pdfbox-2.0.7 is not the same, but very similar, the PDFTextStripper has just been moved from util to text:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
...
PDDocument doc = PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
// textStripper.setSortByPosition(true);
String text = textStripper.getText(doc);

To find out about all of this, as you said the command line tool was very helpful and here are the results of the text extraction with the different options (https://pdfbox.apache.org/1.8/commandline.html and https://pdfbox.apache.org/2.0/commandline.html):

java -jar pdfbox-app-1.8.13.jar ExtractText -sort "20170801 Rechnung.pdf":

SVecramlaöbgleenCsavpeitrawl HRBPrinzregentenstra.l4tu8ng GmbH GUSest-I
2d1N7r7.7D8E,3A0m0t4s3g4e7ri7c4ht München Seite 1 von 1
80538 München ErikcPhoädftzsufwüheritu,nFglo: rian Prucker

java -jar pdfbox-app-1.8.13.jar ExtractText "20170801 Rechnung.pdf":

Scalable CapitalVermögensverwaltung GmbHPrinzregentenstr. 4880538 München
HRB 217778, Amtsgericht MünchenUSt-IdNr. DE300434774Geschäftsführung:Erik 
Podzuweit, Florian Prucker
Seite 1 von 1

java -jar pdfbox-app-2.0.7.jar ExtractText -sort "20170801 Rechnung.pdf":

Scalable Capital HRB 217778, Amtsgericht München Seite 1 von 1
Vermögensverwaltung GmbH USt-IdNr. DE300434774
Prinzregentenstr. 48 Geschäftsführung:
80538 München Erik Podzuweit, Florian Prucker

java -jar pdfbox-app-2.0.7.jar ExtractText "20170801 Rechnung.pdf"

Scalable Capital
Vermögensverwaltung GmbH
Prinzregentenstr. 48
80538 München
HRB 217778, Amtsgericht München
USt-IdNr. DE300434774
Geschäftsführung:
Erik Podzuweit, Florian Prucker
Seite 1 von 1

So I think pdfbox-2.0.7 gives the nicest results in this case, especially without the -sort option, even if I don't know why the algorithms behave differently, since pdfbox-1.8.3 gave the same result with or without the -nonSeq option.

Question:

My objective is to create a custom PDF manager and viewer desktop application, and I've decided to use JavaFX with the PDFBox library.

The problem I'm encountering right now is that the PDF files that are supposed to be opened by the application are protected by/with Adobe LifeCycle Rights Management where it requires a username and password to open.

I've done some research but there doesn't seem to be a detailed solution for the same problem. Any help or a general direction from this roadblock is much appreciated!


Answer:

You'll need two things. 1) The Adobe LiveCycle Portable Protection Library which is a C++ SDK that allows programmatic access to the Rights Management component of your Adobe LiveCycle instance. Because PDFBox is Java, you'll need some way of wrapping that library. 2) An account on the Adobe LiveCycle Rights Management server that secured the file.

If you just have the files but can't access the server, you won't be able to decrypt the file.

Question:

I am using pdfBox for attaching some file into a pdf, file attachment is working correctly however I don't know what checksum to put.

When I manually attach a file and parse it through pdfBox, I see there is a checksum value like

I don't know which checksum algorithm is being inherently used, for sure it's not md5 or sha.

Thanks.


Answer:

It IS MD5.

A quote from table 46, PDF 32000-1:2008:

CheckSum, string, (Optional) A 16-byte string that is the checksum of the bytes of the uncompressed embedded file. The checksum shall be calculated by applying the standard MD5 message-digest algorithm (described in Internet RFC 1321, The MD5 Message-Digest Algorithm; see the Bibliography) to the bytes of the embedded file stream.