Hot questions for Using PDFBox in encryption

Top Java Programmings / PDFBox / encryption

Question:

Following this answer I'm trying to decrypt a pdf-document with pdfbox:

PDDocument pd = PDDocument.load(path);
if(pd.isEncrypted()){
    try {
        pd.decrypt("");
        pd.setAllSecurityToBeRemoved(true);
    } catch (Exception e) {
        throw new Exception("The document is encrypted, and we can't decrypt it.");
    }

This leads to

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1601)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:948)
...
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
...

The path is correct, so I don't know what's going on. Furthermore, if I have a look at the PDDocument.decrypt(String pw) method, I find this: This will decrypt a document. This method is provided for compatibility reasons only. User should use the new security layer instead and the openProtection method especially.

What does it mean? Could someone give an example how to decrypt a pdf-document correctly with pdfbox?


Answer:

See the dependency list: https://pdfbox.apache.org/1.8/dependencies.html

You need to use the bouncycastle libraries.

<dependency>
  <groupId>org.bouncycastle</groupId>
  <artifactId>bcprov-jdk15</artifactId>
  <version>1.44</version>
</dependency>
<dependency>
  <groupId>org.bouncycastle</groupId>
  <artifactId>bcmail-jdk15</artifactId>
  <version>1.44</version>
</dependency>

the decrypt() call is indeed deprecated in the current version (1.8.9). Use

pd.openProtection(new StandardDecryptionMaterial(""));

instead.

Additional advice: download the source code package. You'll find many examples that will help you further.

Question:

I'm using PDFBox to extract text from forms and I have a PDF that is not encrypted with a password but PDFBox says is encrypted. I suspect some sort of Adobe "feature" since when I open it it says (SECURED), while other PDFs that I don't have issues with do not. isEncrypted() returns true so despite not having a password it appears to be secured somehow.

I suspect that it is not decrypting properly, as it is able to pull the form's text prompts but not the responses themselves. In the code below it pulls Address (Street Name and Number) and City from the sample PDF, but not the response in between them.

I am using PDFBox 2.0, but I have also tried 1.8.

I've tried every method of decrypting that I can find for PDFBox, including the deprecated ones (why not). I get the same result as not trying to decrypt at all, just the Address and City prompts.

With PDF's being the absolute nightmare that they are, this PDF was likely created in some non-standard way. Any help in identifying this and getting moving again is appreciated.

Sample PDF

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDPage;
import java.awt.Rectangle;
import java.util.List;


class Scratch {

    private static float pwidth;
    private static float pheight;

    private static int widthByPercent(double percent) {
        return (int)Math.round(percent * pwidth);
    }

    private static int heightByPercent(double percent) {
        return (int)Math.round(percent * pheight);
    }

    public static void main(String[] args) {
        try {
            //Create objects
            File inputStream = new File("ocr/TestDataFiles/i-9_08-07-09.pdf");

            PDDocument document = PDDocument.load(inputStream);

            // Try every decryption method I've found
            if(document.isEncrypted()) {

                // Method 1
                document.decrypt("");

                // Method 2
                document.openProtection(new StandardDecryptionMaterial(""));

                // Method 3
                document.setAllSecurityToBeRemoved(true);

                System.out.println("Removed encryption");
            }

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();

            //Get the page with data on it
            PDPageTree allPages = document.getDocumentCatalog().getPages();
            PDPage page = allPages.get(3);

            pheight = page.getMediaBox().getHeight();
            pwidth = page.getMediaBox().getWidth();

            Rectangle LastName = new Rectangle(widthByPercent(0.02), heightByPercent(0.195), widthByPercent(0.27), heightByPercent(0.1));
            stripper.addRegion("LastName", LastName);
            stripper.setSortByPosition(true);
            stripper.extractRegions(page);
            List<String> regions = stripper.getRegions();

            System.out.println(stripper.getTextForRegion("LastName"));

        } catch (Exception e){
            System.out.println(e.getMessage());
        }
    }
}

Answer:

Brunos comment explains why the PDF is encrypted even though you do not need to enter a password:

A PDF can be encrypted with two passwords: a user password and an owner password. When a PDF is encrypted with a user password, you can't open the document in a PDF viewer without entering that password. When a PDF is encrypted with an owner password only, everyone can open a PDF without that password, but some restrictions may be in place. You can recognize PDFs encrypted with an owner password because they mention "SECURED" in Adobe Reader.

Your PDF is encrypted using only an owner password, i.e. the user password is empty. Thus, you can decrypt it using the empty password "" like this in your PDFBox version:

document.decrypt("");

(This "method 1", by the way, is exactly the same as your "method 2"

document.openProtection(new StandardDecryptionMaterial(""));

plus some exception wrapping.)


Tilman's comment implies the reason why you don't retrieve the form values: Your code uses the PDFTextStripperByArea to do text extraction, but this text extraction only extracts the fixed page content, not the content of the annotations floating on that page.

The content you want to extract is the content of form fields whose widgets are annotations.

Tilman's proposal

doc.getDocumentCatalog().getAcroForm().getField("form1[0].#subform[3].address[0]").getValueAsString()

shows how to extract the value of a form field you know the name of, "form1[0].#subform[3].address[0]" in this case. If you don't know the name of the field you want to extract content from, the PDAcroForm object returned by doc.getDocumentCatalog().getAcroForm() has a number of other methods to access field contents.


By the way, a field name like "form1[0].#subform[3].address[0]" in the AcroForm definition indicates yet another specialty of your PDF: It actually contains two form definitions, the core PDF AcroForm definition and the more independent XFA definition. Both describe the same visual form. Such a PDF form is called a hybrid PDF form.

The advantage of hybrid forms is that they can be viewed and filled in using PDF tools which only know AcroForm forms (which is essentially all software except Adobe's) while PDF tools with XFA support (essentially only Adobe's software) can make use of additional XFA features.

The drawback of hybrid forms is that if they are filled in using a tool without XFA support, only the AcroForm information are updated while the XFA information remain as before. Thus, the hybrid document can contain different data for the same field...