Hot questions for Using PDFBox in maven

Question:

Greets,

I'm currently trying to read the text of a pdf document. After trying more then 15 different solutions, the code still throws this error when I run it with the command "java -jar pdfx.jar":

 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument
        at com.test.pdf.Main.main(Main.java:17)
  Caused by: java.lang.ClassNotFoundException:   org.apache.pdfbox.pdmodel.PDDocument
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 1 more

Main.java

package com.test.pdf;

import java.io.*;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;


public class Main {

     public static void main(String[] args){
     PDDocument pd;
     BufferedWriter wr;
     try {
            File input = new File("C:/Users/Test/Desktop/check.pdf");
            File output = new File("C:/Users/Test/Desktop/Ergebnis.txt");
            pd = PDDocument.load(input);
            System.out.println(pd.getNumberOfPages());
            System.out.println(pd.isEncrypted());
            pd.save("Copy.pdf"); 
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1); 
            stripper.setEndPage(1);
            wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
            stripper.writeText(pd, wr);
            if (pd != null) {
                pd.close();
            }
            wr.close();
     } catch (Exception e){
             e.printStackTrace();
            }
         }  

}

And the pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>PDFReader</groupId>
  <artifactId>PDFReader</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-jar-plugin</artifactId>
        <version>2.4</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
          <archive>
            <manifest>
              <mainClass>com.test.pdf.Main</mainClass>
            </manifest>          
          </archive>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <dependencies>  
    <!-- https://mvnrepository.com/artifact/commons-logging/commons-logging -->
    <dependency>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        <version>1.2</version>
    </dependency>
    <dependency> 
      <groupId>org.apache.pdfbox</groupId> 
      <artifactId>pdfbox</artifactId> 
      <version>2.0.11</version> 
    </dependency> 
  </dependencies> 
</project>

The jars for PdfBox and common-logging are already added in the classpath. The Build runs normally without an error. The pdf file is located on my desktop where i move the jar after the build and run it with cmd.


Answer:

put this plugin in your pom and execute java -jar PDFReader-0.0.1-SNAPSHOT-jar-with-dependencies.jar

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                    <configuration>
                        <archive>
                            <manifest>
                                <mainClass>
                                    com.test.pdf.Main
                                </mainClass>
                            </manifest>
                        </archive>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                </execution>
            </executions>
        </plugin>

Question:

I am writing a plugin for Bitbucket Server, in which I have to deal with rendering PDFs to images. I use PDFBox for this purpose. I have a pdfToPng method that will be called to do the processing and I have modified the PDF renderer similarly to what the Apache examples suggest.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.graphics.color.PDColor;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.rendering.PageDrawer;
import org.apache.pdfbox.rendering.PageDrawerParameters;

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class PDFProcessor {

    private static Logger logger = LoggerFactory.getLogger(PDFProcessor.class);

    private static int _colorMode;

    /**
     * Convert a page of a PDF document into a colored PNG
     * @param color color to use, 0 - black, 1- red
     * @param fileName output file
     * @param pdfFile PDF to process
     * @throws IOException
     */
    public static void pdfToPng(int color, String fileName, File pdfFile) throws IOException{
        _colorMode = color;
        try (PDDocument pdfDoc = PDDocument.load(pdfFile)){
            logger.info("Begin PDF to PNG render for " + fileName);
            PDFRenderer renderer = new CustomPdfRenderer(pdfDoc);
            BufferedImage image = renderer.renderImageWithDPI(0,600);
            logger.info("Making PNG transparent...");
            BufferedImage transparentImage = ImageProcessor.makeTransparent(image,new Color(image.getRGB(0,0)));
            ImageIO.write(transparentImage, "PNG",new File(fileName+".png"));
            logger.info("Image processed successfully, writing to " + fileName + ".png");
        }
    }

    private static class CustomPdfRenderer extends PDFRenderer{
        CustomPdfRenderer(PDDocument document){
            super(document);
        }

        @Override
        protected PageDrawer createPageDrawer(PageDrawerParameters params) throws IOException{
            return new CustomPageDrawer(params);
        }
    }

    private static class CustomPageDrawer extends PageDrawer{
        CustomPageDrawer(PageDrawerParameters params) throws IOException{
            super(params);
        }

        @Override
        protected Paint getPaint(PDColor color) throws IOException{
            if ((color.toRGB() == (Color.BLACK.getRGB() & 0x00FFFFFF)) && (_colorMode == 1)){
                return Color.RED;
            }
            return super.getPaint(color);
        }
    }
}

The project is built using Maven, as this is when BitBucket uses for the plugin development. However, when the method is actually called, I get a ClassNotFound exceptioni stating:

[INFO] Caused by: java.lang.NoClassDefFoundError: org/apache/pdfbox/rendering/PDFRenderer
[INFO]  at com.my-plugin.DiffManager.prepareDiff(DiffManager.java:99)
[INFO]  at com.my-plugin.DiffManager.doGet(DiffManager.java:67)
[INFO]  at com.my-plugin.DiffManager.doPost(DiffManager.java:84)
[INFO]  ... 33 common frames omitted
[INFO] Caused by: java.lang.ClassNotFoundException: org.apache.pdfbox.rendering.PDFRenderer not found by com.my-plugin.integrationPlugin [219]
[INFO]  at org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1532)
[INFO]  ... 36 common frames omitted

As you can see, I have imported the PDFRenderer class. My pom.xml also contains the correct dependency definition:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.14</version>
    <scope>provided</scope>
</dependency>

Why can't Java find the PDFRenderer class? The other dependencies that are declared in this way never has this problem.


Answer:

You are using the dependency of pdfbox with <scope>provided</scope>. Who do you expect to provide this dependency when it is needed? An application container like Tomcat?

I suggest that you delete the scope line, so the dependency will be implicitly compile scope, which should solve the problem for you.

Please consult also the Maven scope docs which say:

  • compile – This is the default scope, used if none is specified. Compile dependencies are available in all classpaths of a project.

  • provided– This is much like compile, but indicates you expect the JDK or a container to provide the dependency at runtime.

Question:

I'm trying to generate PDFs on the fly with PDFBox. When I try to load a new font into PDFBox I get an exception stating that "head is mandatory". It seems that this is normal

PDType0Font pdfFont = PDType0Font.load(doc, fontFile);

https://issues.apache.org/jira/browse/PDFBOX-3260

What I'm having trouble finding out is : do some fonts just not come with this 'head'? I've tried downloading from a couple of other sources with the same result. Is there a (free) way that I can modify a ttf file so that it will meet this requirement?

The client will be satisfied with nothing but Calibri.

Thanks

EDIT:

As suggested in the link above I tried using a Resource and an InputStream to get the font

Resource fontResource = appContext.getResource("classpath:/WEB-INF/classes/reports/calibri/calibri.ttf");

and

InputStream fontFile = new FileInputStream(new File(pathToFile));

and I've also added the following to my maven resources plugin:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-resources-plugin</artifactId>
    <version>2.4.3</version>
    <configuration>
        <resources>
            <resource>
                <directory>src/main/resources/reports/calibri</directory>
                <filtering>false</filtering>
            </resource>
        </resources>
        <encoding>${project.encoding}</encoding>
    </configuration>
</plugin>

These maven changes were based on this: https://maven.apache.org/plugins/maven-resources-plugin/examples/filter.html

The result is the same, head is mandatory

Admittedly, I don't know too much about maven (or fonts or java...)


Answer:

So, it turned out I was not correctly excluding the font from resource filtering in Maven. The following did the trick.

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-resources-plugin</artifactId>
            <version>2.4.3</version>
            <configuration>
                <encoding>${project.encoding}</encoding>
                <nonFilteredFileExtensions>
                    <nonFilteredFileExtension>ttf</nonFilteredFileExtension>
                </nonFilteredFileExtensions>
            </configuration>
        </plugin>

Following suggestions from @Tilman Hausherr I tried loading the font in a barebones java main method and confirmed that it worked ok there, and then went back to look at maven again.

Thanks

Question:

I have tried with PDFTextStripperByArea and PDPageContentStream classes to extract the number values from my pdf file. They work fine!

But my requirement is to use PDFTable or PDFTableExtractor class to read the pdf contents. Can you tell me what is the maven dependency and jar file I need to use to access the above said classes? Also mention the required methods to get the values from a particular position.

I have another doubt. Can we extract the table formatted data from PDF file as it is? I meant the data with rows and columns with table lines. If a page contains some text and a table, can we just read only the table headers and the rows? I have uploaded my page in GitHub. Click here! From that image, I only need the values of Gross premium, GST and Total Payable. Please let me know whether it's possible


Answer:

First, don't use classes from packages com.lowagie That code is old, obsolete and no longer supported. Furthermore, this code belonged to the very early version of iText.

Afterwards a thorough investigation was done into the intellectual property rights of all the code (since iText has had a lot of contributors). When you use the old code, you may (unknowingly) be using code for which you do not have the copyright.

Second, if you just want to solve the problem of extracting numbers and tables from a PDF document, have a look at pdf2Data. It's an iText add-on that makes things a lot easier.

It gives you a nice UI, where you can build templates for data extraction. Then you can call a single method to match an existing (XML) template against an input PDF document, and you'd get a datastructure that contains all the information about the match.

http://pdf2data.online/

Question:


Answer:

As the error reads, there is NoClassDefFoundError for org/bouncycastle/jce/provider/BouncyCastleProvider in which case you can use the maven library for the same by adding the following inside your <dependencies> :

<!-- https://mvnrepository.com/artifact/org.bouncycastle/bcprov-jdk16 -->
<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcprov-jdk15on</artifactId>
    <version>1.54</version>
</dependency>
<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcmail-jdk15on</artifactId>
    <version>1.54</version>
</dependency>
<dependency>
    <groupId>org.bouncycastle</groupId>
    <artifactId>bcpkix-jdk15on</artifactId>
    <version>1.54</version>
</dependency>

This shall help you import the package and use the Class required in your code.

More dependencies may be needed, see here.

Question:

This is my first time using a Maven repository so apologies if its a simple resolution.

My code is as follows:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessRead;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class application {

public static void main(String args[]) {
    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    File file = new File("/Users/Desktop/Corporate reports/previous 'fetch' items/ARM2009.pdf");
    try {
        PDFParser parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(5);
        String parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        System.out.println("Failed to parse : " +file);
    } 
}

}

Essentially, the 19th line where it say:

PDFParser parser = new PDFParser(new FileInputStream(file));

is giving an error at compile time. It is saying:

The constructor PDFParser(FileInputStream) is undefined

I am not sure how to handle this. My IDE recommends cast the argument to RandomAccessRead but this just ends up with a different error at run time

Please help thank you.


Answer:

If you compare javadocs for the pdfparser in pdfbox v2 vs v1.8, you will notice that the constructor definition has changed from

PDFParser(InputStream input)

to

PDFParser(RandomAccessRead source)

So please make sure you reference the correct version from maven. If you plan to stick to version 2, make sure to use something like RandomAccessFile, not a FileInputStream.