Hot questions for Using PDFBox in pdfa

Question:

I am trying to create PDF/A file using PDFBOX and file genearation is done successfully but generated file is very large in size... Some times 500 MBs or even more. Is there any way to decrease file size while generation ?


Answer:

As discussed in the comments: PDFont objects of a specific font should be constructed only once, and it can be reused in different pages of one PDF.

Fonts should be subsetted (i.e. that only the used glyphs are embedded), for that use PDType0Font.load().

The same applies to PDXObjectImage objects, e.g. for a company logo: the PDXObjectImage should be created once and be reused in different pages of one PDF.

PD objects shouldn't be used in different PDFs.

TrueTypeFont font objects can be reused in several documents:

TrueTypeFont ttf = new TTFParser().parse(file);
PDFont font1 = PDType0Font.load(document1, ttf, true); // last parameter should be false if used for acroForm fields
PDFont font2 = PDType0Font.load(document2, ttf, true);
PDFont font3 = PDType0Font.load(document3, ttf, true);

Question:

I have to check if a pdf file is in PDF/A 1-a format or not using pdfbox or any other free library in java . I have searched a lot on google in this regard but still i couldnt get any code or technique for doing this.

How can I check this in java .


Answer:

The document from pdfbox shows how to do PDF/A-1b validation:

https://pdfbox.apache.org/cookbook/pdfavalidation.html

to do pdf/a-1a validation, you simply change :

  parser.parse();

to:

 parser.parse(Format.PDF_A1A);

I was able to ascertain this from reading the parser source code located here:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/preflight/1.8.2/org/apache/pdfbox/preflight/parser/PreflightParser.java

Question:

I am using PDFBOX Preflight to validate pdf document to check whether it is in PDF/A-1b format or not . It works perfectly on java 1.7 but when I run the code in java 1.8 i get following errors

2.4.3 : Invalid Color space, DestOutputProfile is missing

2.4.3 : Invalid Color space, DestOutputProfile is missing

2.4.3 : Invalid Color space, DestOutputProfile is missing

7.11 : Error on MetaData

I am using pdfbox 1.8.8 and preflight 1.8.3

Following is the code that I am using for validating the PDFs doing this.

                ValidationResult result = null;
                FileDataSource fd = new FileDataSource(InputFolder
                        + listOfFiles[i].getName());
                PreflightParser parser = new PreflightParser(fd);
                try {
                    parser.parse(Format.PDF_A1A);
                    PreflightDocument documentt = parser
                            .getPreflightDocument();
                    documentt.validate();
                    result = documentt.getResult();
                    documentt.close();
                } catch (SyntaxValidationException e) {
                    result = e.getResult();
                }
                if (result.isValid()) {
                    System.out
                            .println("The file  is a valid PDF/A-1a file");

                } else {
                    System.out.println("The file is not valid, error(s) :");

                            for (ValidationError error : result
                                    .getErrorsList()) {
                                message = error.getErrorCode() + " : "
                                        + error.getDetails();
                                fos.write(message.getBytes());
                                fos.write(System.getProperty(
                                        "line.separator").getBytes());
                                // System.out.println(error.getErrorCode() +
                                // " : " + error.getDetails());
                            }

}

Is PDFBOX not compatible with java 1.8 or am I doing something wrong ?


Answer:

As solved in the comments: always use the same version of the PDFBox and the Preflight jar files, which is 1.8.8 at the time this response is written.

Additional bonus advice: when getting results that you don't believe, get a "2nd opinion" with the free PDF-Tools PDF/A-1b validator.

If the results are different, open an issue in JIRA or try the 2.0 snapshots of PDFBox + preflight:

Question:

I'm trying to create a PDF/A file using PDFBox 2. My code is based on the exmpale code here. The code runs wihtout errors. But if I validate the file using callas pdfPilot and veraPDF there is no XMP metadata and no PDF/A version info. Also the PDF file is version 1.4. Not 1.7 as set in the code.

// TTF font needed for Unicode support in OCR texts
PDFont font = PDType0Font.load(document,
    PDDocument.class.getResourceAsStream("/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf"), true);

// Add metadata (needed by PDF/A)
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
try {
    DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
    dc.setTitle("THE DOCUMENT TITLE");
    dc.addCreator("THE AUTHOR");

    PDFAIdentificationSchema id = xmp.createAndAddPFAIdentificationSchema();
    id.setPart(2);
    id.setConformance("B");

    XmpSerializer serializer = new XmpSerializer();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    serializer.serialize(xmp, baos, true);

    PDMetadata metadata = new PDMetadata(document);
    metadata.importXMPMetadata(baos.toByteArray());
    document.getDocumentCatalog().setMetadata(metadata);
} catch (BadFieldValueException e) {
    throw new IllegalArgumentException("", e);
}

// Set color profile (needed by PDF/A)
InputStream colorProfile = PDDocument.class.getResourceAsStream("/sRGB.icc");
PDOutputIntent intent = new PDOutputIntent(document, colorProfile);
intent.setInfo("sRGB IEC61966-2.1");
intent.setOutputCondition("sRGB IEC61966-2.1");
intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
intent.setRegistryName("http://www.color.org");
document.getDocumentCatalog().addOutputIntent(intent);

// Render all pages
for (IPage page : pages) {
    ((PdfboxPage)page).setFont(font);
    page.renderPage(this);
    document.addPage((PDPage) page.getPage());
}

document.setVersion(1.7f);
document.save(path);
document.close();

What am I doing wrong?

EDIT 1:

I can see there is the xpacket in the PDF file. It includes the metadata. But it looks like PDFBox doesn't write this data in a valid way (for veraPDF and pdfPilot).

EDIT 2:

Looks like PDFBox 2.0.12 builds invalid PDF/A. I converted the PDF using our commercial pdfPilot program. (PDF/A-1b)

PDFBox writes this to the PDF file (-> invalid in veraPDF and pdfPilot):

<?xpacket begin="
" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
         <dc:title>
            <rdf:Alt>
               <rdf:li lang="x-default">THE DOCUMENT TITLE</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>THE AUTHOR</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
      <rdf:Description xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="">
         <pdfaid:part>1</pdfaid:part>
         <pdfaid:conformance>B</pdfaid:conformance>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

pdfPilot writes this to the PDF file (-> valid in veraPDF and pdfPilot):

<?xpacket begin="
" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 81.159809, 2016/11/11-01:42:16        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
            xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
            xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/"
            xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#"
            xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#">
         <dc:format>application/pdf</dc:format>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>AUTOR</rdf:li>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">TITEL</rdf:li>
            </rdf:Alt>
         </dc:title>
         <xmp:ModifyDate>2019-01-11T11:42:22+01:00</xmp:ModifyDate>
         <xmp:CreateDate>2019-01-11T11:42:21+01:00</xmp:CreateDate>
         <xmp:MetadataDate>2019-01-11T11:42:22+01:00</xmp:MetadataDate>
         <xmpMM:DocumentID>uuid:b60f88c2-aa89-11b2-0a00-104bbf060000</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:b61148b9-aa89-11b2-0a00-60d9faa0ff7f</xmpMM:InstanceID>
         <xmpMM:RenditionClass>default</xmpMM:RenditionClass>
         <xmpMM:VersionID>1</xmpMM:VersionID>
         <xmpMM:History>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>converted</stEvt:action>
                  <stEvt:instanceID>uuid:b60f88c3-aa89-11b2-0a00-902dfba0ff7f</stEvt:instanceID>
                  <stEvt:parameters>converted to PDF/A-1b</stEvt:parameters>
                  <stEvt:softwareAgent>pdfaPilot</stEvt:softwareAgent>
                  <stEvt:when>2019-01-11T11:42:22+01:00</stEvt:when>
               </rdf:li>
            </rdf:Seq>
         </xmpMM:History>
         <pdfaid:part>1</pdfaid:part>
         <pdfaid:conformance>B</pdfaid:conformance>
         <pdfaExtension:schemas>
            <rdf:Bag>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://ns.adobe.com/xap/1.0/mm/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>xmpMM</pdfaSchema:prefix>
                  <pdfaSchema:schema>XMP Media Management Schema</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>UUID based identifier for specific incarnation of a document</pdfaProperty:description>
                           <pdfaProperty:name>InstanceID</pdfaProperty:name>
                           <pdfaProperty:valueType>URI</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>The common identifier for all versions and renditions of a document.</pdfaProperty:description>
                           <pdfaProperty:name>OriginalDocumentID</pdfaProperty:name>
                           <pdfaProperty:valueType>URI</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://www.aiim.org/pdfa/ns/id/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>pdfaid</pdfaSchema:prefix>
                  <pdfaSchema:schema>PDF/A ID Schema</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Part of PDF/A standard</pdfaProperty:description>
                           <pdfaProperty:name>part</pdfaProperty:name>
                           <pdfaProperty:valueType>Integer</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Amendment of PDF/A standard</pdfaProperty:description>
                           <pdfaProperty:name>amd</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Conformance level of PDF/A standard</pdfaProperty:description>
                           <pdfaProperty:name>conformance</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
            </rdf:Bag>
         </pdfaExtension:schemas>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

And if I statically write this to the PDF file it produces a valid PDF/A file:

String xmpData = "<?xpacket ......";
PDMetadata metadata = new PDMetadata(document);
metadata.importXMPMetadata(xmpData.getBytes());

EDIT 3:

Adding this is valid and short:

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" >
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
            <dc:format>application/pdf</dc:format>
            <dc:creator>
            <rdf:Seq>
                <rdf:li>AUTOR</rdf:li>
            </rdf:Seq>
            </dc:creator>
            <dc:title>
            <rdf:Alt>
                <rdf:li xml:lang="x-default">TITEL</rdf:li>
            </rdf:Alt>
            </dc:title>
            <pdfaid:part>1</pdfaid:part>
            <pdfaid:conformance>B</pdfaid:conformance>
        </rdf:Description>
    </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

Answer:

The is a difference between the XML produced by the CreatePDFA example

<rdf:li xml:lang="x-default">THE DOCUMENT TITLE</rdf:li> 

to what you got

<rdf:li lang="x-default">THE DOCUMENT TITLE</rdf:li>

and this reminded me of a problem we had 1 1/2 years ago and that was discussed here.

So to quote from my answer from 2017: This code

Transformer transformer = TransformerFactory.newInstance().newTransformer();

should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl class. If not, then call

Transformer transformer =
    TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer(); 

or set a system property:

System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");

What I can't answer (because you didn't tell) is how you ended up having this transformer, and what will happen to the rest of your application if you change it.

Question:

Preflight (version 2.0.15) tool has validated correctly the generated pdf (was created with pdfbox version 2.0.15) file but online pdf-tools (e.x. https://www.pdf-online.com/osa/validate.aspx) does not validate it correctly. I am getting below error:

Compliance pdfa-1b Result Document does not conform to PDF/A. Details Validating file "file.pdf" for conformance level pdfa-1b

Anonymous RDF resources (rdf:Description without rdf:about attribute) are not allowed in XMP Metadata.

The appearance dictionary doesn't contain an entry.

The appearance dictionary doesn't contain an entry.

The appearance dictionary doesn't contain an entry.

The appearance dictionary doesn't contain an entry.

The appearance dictionary doesn't contain an entry.

The document does not conform to the requested standard.

The document contains annotations or form fields with ambigous or without appropriate appearances.

The document's meta data is either missing or inconsistent or corrupt. The document does not conform to the PDF/A-1b standard.

Done.

In order to generate metadata I use below code:

private void addMetadata(PDDocument pdDocument,final String zzz,final String yyy) {

    PDDocumentCatalog catalog = pdDocument.getDocumentCatalog();
    PDDocumentInformation info = pdDocument.getDocumentInformation();
    info.setCreationDate(Calendar.getInstance());
    info.setModificationDate(Calendar.getInstance());
    info.setAuthor(metadataAuthor);
    info.setProducer(metadataProducer);
    info.setTitle(zzz + "_" + yyy);
    info.setKeywords("aaa");
    info.setCreator("aaa");
    info.setSubject("aaa");

    PDMarkInfo markInfo = new PDMarkInfo();
    markInfo.setMarked(true);
    catalog.setMarkInfo(markInfo);

    try {
        PDMetadata metadataStream = new PDMetadata(pdDocument);
        catalog.setMetadata( metadataStream );

        XMPMetadata xmp = new XMPMetadata();
        XMPSchemaPDFAId pdfaid = new XMPSchemaPDFAId(xmp);
        xmp.addSchema(pdfaid);
        pdfaid.setConformance("B");
        pdfaid.setPart(1);
        pdfaid.setAbout("");

        XMPSchemaDublinCore dcSchema = xmp.addDublinCoreSchema();
        dcSchema.setTitle( info.getTitle() );
        dcSchema.addCreator("aaa");
        dcSchema.setDescription( info.getSubject() );

        XMPSchemaPDF pdfSchema = xmp.addPDFSchema();
        pdfSchema.setKeywords( info.getKeywords() );
        pdfSchema.setProducer( info.getProducer() );

        XMPSchemaBasic basicSchema = xmp.addBasicSchema();
        basicSchema.setModifyDate( info.getModificationDate() );
        basicSchema.setCreateDate( info.getCreationDate() );
        basicSchema.setCreatorTool( info.getCreator() );

        metadataStream.importXMPMetadata(xmp.asByteArray());

        InputStream colorProfile = getClass().getClassLoader().getResourceAsStream("icm/sRGB Color Space Profile.icm");
        // create output intent

        PDOutputIntent oi = new PDOutputIntent(pdDocument, colorProfile); 
        String value = "sRGB IEC61966-2.1";
        oi.setInfo(value); 
        oi.setOutputCondition(value); 
        oi.setOutputConditionIdentifier(value); 
        oi.setRegistryName("http://www.color.org"); 
        catalog.addOutputIntent(oi);

    } catch (Exception e) {
        e.printStackTrace()
    }

}

Any suggestions?


Answer:

As discussed in the comments:

1) The failure to report "The appearance dictionary doesn't contain an entry" is a bug in PDFBox preflight that will be fixed in 2.0.17, see PDFBOX-4586. According to this document:

An ISO 19005-1 validator shall FAIL otherwise conforming files in which a widget annotation lacks an appearance dictionary

2) The "rdf:Description without rdf:about attribute" may or may not be a bug. VeraPDF doesn't consider it to be one. Your code used an 1.8.* version. For these, you can call dcSchema.setAbout("") to fix this. In 2.0.* the problem doesn't occur if you created the schema with metadata.createAndAddDublinCoreSchema().

I have created an issue in the VeraPDF project and they will bring this question for discussion at the next meeting of the Validation technical working group.

3) That the widgets didn't contain an entry is because at the time setValue() was called, not enough information was present (e.g. the rectangle).That is why you got the message widget of field aa has no rectangle, no appearance stream created.

Question:

Here is my problem, I've made a java program with the library PdfBox to make pdf from the image and other pdf so this work fine, but I want to generate PDF/A-1. The problem is that I can't embed a color space.

I've tried the code of CreatePDFA.java that is given by PDFBox

// Create output intent
InputStream colorProfile = CreatePDFA.class.getResourceAsStream("colorSpacePath");
PDOutputIntent oi = new PDOutputIntent(doc, colorProfile); 
oi.setInfo("sRGB IEC61966-2.1"); 
oi.setOutputCondition("sRGB IEC61966-2.1"); 
oi.setOutputConditionIdentifier("sRGB IEC61966-2.1"); 
oi.setRegistryName("http://www.color.org"); 
doc.getDocumentCatalog().addOutputIntent(oi);

I get a NullPointerException at the line: PDOutputIntent oi = new PDOutputIntent(doc, colorProfile);

Exception:

Exception in thread "main" java.lang.NullPointerException
at java.desktop/java.awt.color.ICC_Profile.getProfileDataFromStream(ICC_Profile.java:1034)
at java.desktop/java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1016)
at org.apache.pdfbox.pdmodel.graphics.color.PDOutputIntent.configureOutputProfile(PDOutputIntent.java:112)
at org.apache.pdfbox.pdmodel.graphics.color.PDOutputIntent.<init>(PDOutputIntent.java:49)
at src.Kairos.CreatePDFA.doIt(CreatePDFA.java:124)
at src.Kairos.CreatePDFA.main(CreatePDFA.java:153)

Answer:

Here is the code that work :

InputStream colorProfile = new FileInputStream(colorSpacePath);
PDOutputIntent oi = new PDOutputIntent(doc, colorProfile);
oi.setInfo("sRGB IEC61966-2.1");
oi.setOutputCondition("sRGB IEC61966-2.1");
oi.setOutputConditionIdentifier("sRGB IEC61966-2.1")
oi.setRegistryName("http://www.color.org");
doc.getDocumentCatalog().addOutputIntent(oi);
colorProfile.close()