Hot questions for Using PDFBox in html

Question:

Input PDF document with comment

I have a PDF document with highlight and comment on the highlight ("my comment") (downlload).

Desired output

I want to convert the PDF into text, where comment is in tags, something like this:

ONE TWO THREE    
FOUR <b id="my comment">FIVE</b> SIX SEVEN

Question

Can anyone help me how to implement method:

private double getDistance(PDAnnotation ann, TextPosition firstProsition) {...}

or the method

private boolean isTextAnnotated()

to determine if the annotation ann is at the position of the text? If possible also the text position of the comment would be nice to determine.

JAVA code

Anyway I got lost regarding how to determine, if annotation is related to the currently processed text. I also do not know, if it is possible to identify exact part of the text.

                PDFParser parser = new PDFParser(new FileInputStream(file));
                parser.parse();
                cosDoc = parser.getDocument();

                pdfStripper = new PDFTextStripper()
                {
                    List<PDAnnotation> la;
                    private boolean closeWithEnd;
                    @Override
                    protected void startPage(PDPage page) throws IOException
                    {
                        la = page.getAnnotations(); // init pages
                        startOfLine = true;
                        super.startPage(page);
                    }

                    @Override
                    protected void writeLineSeparator() throws IOException
                    {
                        startOfLine = true;
                        super.writeLineSeparator();
                        if(closeWithEnd) {
                            writeString(" </b> ");
                        }
                    }

                    @Override
                    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
                    {
                        if (startOfLine)
                        {
                            TextPosition firstProsition = textPositions.get(0);
                            PDAnnotation ann;
                            if((ann = isTextAnnotated(firstProsition, text)) != null) {
                                writeString(" <b id='"+ann.getAnnotationName()+"'> ");
                                closeWithEnd = true;
                            } else {
                                closeWithEnd = false;
                            }
                            startOfLine = false;
                        }
                        super.writeString(text+" ", textPositions);
                    }
                    private PDAnnotation isTextAnnotated(TextPosition firstProsition, String text) {
                        for (PDAnnotation ann : la) {
                            System.out.println(text+" ------------- "+getDistance(ann, firstProsition));
                        }
                        return null;
                    }
                    private double getDistance(PDAnnotation ann, TextPosition firstProsition) {
                        TODO - how to get distance
                        return 0.0;
                    }
                    boolean startOfLine = true;
                };

                pdDoc = new PDDocument(cosDoc);
                pdfStripper.setStartPage(0);
                pdfStripper.setEndPage(pdDoc.getNumberOfPages());
                String parsedText = pdfStripper.getText(pdDoc);

Maven dependencies

<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <version>3.8.1</version>
  <scope>test</scope>
</dependency>

<!-- http://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.10</version>
</dependency>

<!-- http://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.13</version>
</dependency>

<!-- http://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.4</version>
</dependency>


<!-- http://mvnrepository.com/artifact/log4j/log4j -->
<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.17</version>
</dependency>

<dependency>
    <groupId>info.debatty</groupId>
    <artifactId>java-string-similarity</artifactId>
    <version>RELEASE</version>
</dependency>

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.6.0</version>
</dependency>


Answer:

You can get the annotation rectangle and see if it contains both the upper left and lower right corner of each text position. Since writeString contains several characters you'll want to check each character individually since the annotation may cover just a subset of the characters. The annotation may also wrap lines, so you will want to check at the end of the page (not at the end of each line) if you need to close your html tag. Note that the rectangle you get from the annotation is in PDF space. But the coordinates you get from the TextPosition is in java space. So when you check Rectangle.contains you'll need to translate the text position coordinates to PDF space.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

public class MyPDFTextStripper extends PDFTextStripper
{
    public MyPDFTextStripper() throws IOException
    {
        super();
        // TODO Auto-generated constructor stub
    }

    PDPage currentPage;
    List<PDAnnotation> pageAnnotations;
    private boolean needsEndTag;
    boolean startOfLine = true;

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        currentPage = page;
        pageAnnotations = currentPage.getAnnotations();
        super.startPage(page);
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        StringBuilder newText = new StringBuilder();
        PDAnnotation currentAnnot = null;
        for (TextPosition textPosition : textPositions)
        {
            PDAnnotation annotation = getAnnotation(textPosition);
            if (annotation != null)
            {
                if (currentAnnot == null)
                {
                    // if the currentAnnot is null, start a new annotation
                    newText.append("<b id='" + annotation.getAnnotationName() + "'>");
                }
                else if (!currentAnnot.getAnnotationName().equals(annotation.getAnnotationName()))
                {
                    // if the current Annot is different, end it and start a new
                    // one
                    newText.append("</b><b id='" + annotation.getAnnotationName() + "'>");
                }
                // remember this in case the annotation wraps lines
                needsEndTag = true;
                currentAnnot = annotation;
            }
            else if (currentAnnot != null)
            {
                // if no new annotation is associated with the text, but there used to be, close the tag
                newText.append("</b>");
                currentAnnot = null;
                needsEndTag = false;
            }
            newText.append(textPosition.getCharacter());
        }
        super.writeString(newText.toString(), textPositions);
    }

    private PDAnnotation getAnnotation(TextPosition textPosition)
    {
        float textX1 = textPosition.getX();
        // Translate the y coordinate to PDF Space
        float textY1 = currentPage.findMediaBox().getHeight() - textPosition.getY();
        float textX2 = textX1 + textPosition.getWidth();
        float textY2 = textY1 + textPosition.getHeight();

        for (PDAnnotation annotation : pageAnnotations)
        {
            if (annotation.getRectangle().contains(textX1, textY1) && annotation.getRectangle().contains(textX2, textY2))
            {
                return annotation;
            }
        }
        return null;
    }

    @Override
    public String getPageEnd()
    {
        // if the annotation wraps lines and extends to the end of the document, need to add the end tag
        if (needsEndTag)
        {
            return "</b>" + super.getPageEnd();
        }
        return super.getPageEnd();
    }

    public static void main(String[] args) throws Exception
    {
        File file = new File(args[0]);
        PDFParser parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        COSDocument cosDoc = parser.getDocument();

        MyPDFTextStripper pdfStripper = new MyPDFTextStripper();

        PDDocument pdDoc = new PDDocument(cosDoc);
        pdfStripper.setStartPage(0);
        pdfStripper.setEndPage(pdDoc.getNumberOfPages());
        String parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText);
    }
}

Question:

I want to print html text in my pdf how to print it. Ex: if my text is <b > Hai </b > then while printing pdf it should be bold like Hai rather than printing the complete html string as it is


Answer:

This is not available in PDFBox. Use a tool on top of PDFBox like openhtmltopdf.

Question:

I create a PDF from HTML with Jsoup and OpenHTMLToPDF. I have to use a different font in my PDF to have non-latin glyphcs covered (see here). How can I embed my font correctly?

Simplified program reproducing the issue:

src/main/resources/test.html

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8" />
        <title>Font Test</title>
        <style>
            @font-face {
                font-family: 'source-sans';
                font-style: normal;
                font-weight: 400;
                src: url(fonts/SourceSansPro-Regular.ttf);
            }
        </style>
    </head>
    <body>    
        <p style="font-family: 'source-sans',serif">Latin Script</p>
        <p style="font-family: 'source-sans',serif">Είμαι ελληνικό κείμενο.</p>
    </body>
</html>
  • this file shall be written as PDF
  • In a browser it looks correctly and uses the Source Sans font.

src/main/java/main.java:

import com.openhtmltopdf.extend.FSSupplier;
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;
import org.jsoup.Jsoup;
import org.jsoup.helper.W3CDom;
import org.w3c.dom.Document;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.Objects;

public class main {
    public static void main(String[] args) {
        System.out.println("Starting");

        try {

            final W3CDom w3cDom = new W3CDom();
            final Document w3cDoc = w3cDom.fromJsoup(Jsoup.parse(readFile()));
            final OutputStream outStream = new FileOutputStream("test.pdf");

            final PdfRendererBuilder pdfBuilder = new PdfRendererBuilder();
            pdfBuilder.useFastMode();
            pdfBuilder.withW3cDocument(w3cDoc, "/");
            pdfBuilder.useFont(new File(main.class.getClassLoader().getResource("fonts/SourceSansPro-Regular.ttf").getFile()), "source-sans");
            pdfBuilder.toStream(outStream);

            pdfBuilder.run();
            outStream.close();

        } catch (Exception e) {
            System.out.println("PDF could not be created: " + e.getMessage());
        }

        System.out.println("Finish.");
    }


    private static String readFile() throws IOException {
        final ClassLoader classLoader = main.class.getClassLoader();
        final InputStream inputStream = classLoader.getResourceAsStream("test.html");
        final StringBuilder sb = new StringBuilder();
        final Reader r = new InputStreamReader(Objects.requireNonNull(inputStream), StandardCharsets.UTF_8);
        char[] buf = new char[1024];
        int amt = r.read(buf);
        while(amt > 0) {
            sb.append(buf, 0, amt);
            amt = r.read(buf);
        }
        return sb.toString();
    }
}
  • Don't bother about the second function, it just reads the HTML file and is only included here, to have a complete program.

src/main/resources/fonts/SourceSansPro-regular.ttf

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>paf</groupId>
    <artifactId>test</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>7</source>
                    <target>7</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <groupId>com.openhtmltopdf</groupId>
            <artifactId>openhtmltopdf-pdfbox</artifactId>
            <version>0.0.1-RC18</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.2</version>
        </dependency>
    </dependencies>
</project>
Program output:
Starting
com.openhtmltopdf.load INFO:: TIME: parse stylesheets  148ms
com.openhtmltopdf.match INFO:: media = print
com.openhtmltopdf.match INFO:: Matcher created with 147 selectors
com.openhtmltopdf.load INFO:: Loading font(source-sans) from InputStream supplier now.
com.openhtmltopdf.exception WARNING:: bad URL given: /fonts/SourceSansPro-Regular.ttf
com.openhtmltopdf.exception WARNING:: Could not load @font-face font: /fonts/SourceSansPro-Regular.ttf
com.openhtmltopdf.exception WARNING:: Font metrics not available. Probably a bug.
com.openhtmltopdf.exception WARNING:: Font metrics not available. Probably a bug.
com.openhtmltopdf.render WARNING:: Font is null.
com.openhtmltopdf.render WARNING:: Font is null.
com.openhtmltopdf.render WARNING:: Font is null.
com.openhtmltopdf.render WARNING:: Font is null.
com.openhtmltopdf.render WARNING:: Font is null.
com.openhtmltopdf.render WARNING:: Font is null.
com.openhtmltopdf.render WARNING:: Font is null.
Finish.
Resulting PDF
Latin Script
##### ######## #######.
  • In Serif-Font.

Edit 1: Various changes according to the pages linked in the comments and updated to RC18. new output now, but font in PDF still not right.


Edit 2: Tried fast Renderer


Answer:

Okay. Thanks to the comments of @Tilman Hausherr I asked in the GitHub-Issue Tracker of openhtmltopdf and got some help.

Theese changes made it work, if someone landing here is interested:

src/main/java/main.java (changed part only, see rest above):

    public static void main(String[] args) {
        System.out.println("Starting");

        try {

            final W3CDom w3cDom = new W3CDom();
            final Document w3cDoc = w3cDom.fromJsoup(Jsoup.parse(readFile()));
            final OutputStream outStream = new FileOutputStream("test.pdf");

            final PdfRendererBuilder pdfBuilder = new PdfRendererBuilder();
            pdfBuilder.useFastMode();
            pdfBuilder.withW3cDocument(w3cDoc, "/");
            pdfBuilder.useFont(new File(main.class.getClassLoader().getResource("fonts/SourceSansPro-Regular.ttf").getFile()), "source-sans");
            pdfBuilder.toStream(outStream);

            pdfBuilder.run();
            outStream.close();

        } catch (Exception e) {
            System.out.println("PDF could not be created: " + e.getMessage());
        }

        System.out.println("Finish.");
    }

src/main/resources/fonts/SourceSansPro-regular.ttf

from src/main/resources/test.html (changed part only, see rest above)

        @font-face {
            font-family: 'source-sans';
            font-style: normal;
            font-weight: 400;
            src: url(fonts/SourceSansPro-Regular.ttf);
            -fs-font-subset: complete-font;
        }

Question:

While writing PDF file to HTML file format using the code below...

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Writer;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.fit.pdfdom.PDFDomTree;
import org.fit.pdfdom.PDFDomTreeConfig;
import org.fit.pdfdom.resource.HtmlResourceHandler;
import org.fit.pdfdom.resource.SaveResourceToDirHandler;

public class PdfToHtmlConverter {

    public String pdfToHtmlFileWriter(File file, String outputFilePath, String outputFileName) throws InvalidPasswordException, IOException, ParserConfigurationException {
        // load the PDF file using PDFBox
        PDDocument pdf = PDDocument.load(file);
        PDFDomTreeConfig config = PDFDomTreeConfig.createDefaultConfig();
        HtmlResourceHandler fontHandler = new SaveResourceToDirHandler();
        config.setFontHandler(fontHandler);

        HtmlResourceHandler imageHandler = new SaveResourceToDirHandler();
        config.setImageHandler(imageHandler);


        // create the DOM parser
        PDFDomTree parser = new PDFDomTree();
        // parse the file and get the DOM Document
        String outputFile = outputFilePath + File.separator + outputFileName + ".html";
        try (Writer woutput = new PrintWriter(new BufferedWriter(new FileWriter(outputFile)))) {
            parser.writeText(pdf, woutput);
        } catch(Exception e) {
            e.printStackTrace();
        }

        pdf.close();
        return outputFile;
    }
}

And the build.gradle file has following dependency list...

dependencies {
    compile fileTree(dir: 'lib', include: ['*.jar'])
    compile group: 'org.apache.pdfbox',             name: 'pdfbox',         version: '2.0.6'
    compile group: 'org.apache.pdfbox',             name: 'pdfbox-tools',   version: '2.0.6'
    compile group: 'org.apache.logging.log4j',      name: 'log4j',          version: '2.11.0'
    compile group: 'org.apache.logging.log4j',      name: 'log4j-api',      version: '2.6.1'
    compile group: 'org.apache.logging.log4j',      name: 'log4j-core',     version: '2.6.1'
    compile group: 'javax.mail',                    name: 'mail',           version: '1.4.1'
    compile group: 'org.bouncycastle',              name: 'bcmail-jdk15',   version: '1.46' 
    compile group: 'org.bouncycastle',              name: 'bcprov-jdk15on', version: '1.47'
    compile group: 'net.sf.ehcache',                name: 'ehcache-core',   version: '2.4.6'
    compile group: 'com.google.guava',              name: 'guava',          version: '11.0.2'
    compile group: 'redis.clients',                 name: 'jedis',          version: '2.9.0'
    compile group: 'org.apache.poi',                name: 'poi-ooxml',      version: '3.17'
    compile group: 'org.apache.poi',                name: 'poi',            version: '3.17'
    compile group: 'net.sf.cssbox',                 name: 'pdf2dom',        version: '1.7'
    compile group: 'com.levigo.jbig2',              name: 'levigo-jbig2-imageio', version: '1.6.5'

    compile 'com.google.code.gson:gson:2.8.2'
    compile 'org.json:json:20180130'
}

Aw Snap! Got the following message from JDK...

[org.glassfish.jersey.server.ContainerException: java.util.ServiceConfigurationError: com.levigo.jbig2.util.log.LoggerBridge: Provider com.levigo.jbig2.util.log.JDKLoggerBridge not a subtype] with root cause
java.util.ServiceConfigurationError: com.levigo.jbig2.util.log.LoggerBridge: Provider com.levigo.jbig2.util.log.JDKLoggerBridge not a subtype
    at java.util.ServiceLoader.fail(Unknown Source)
    at java.util.ServiceLoader.access$300(Unknown Source)
    at java.util.ServiceLoader$LazyIterator.nextService(Unknown Source)
    at java.util.ServiceLoader$LazyIterator.next(Unknown Source)
    at java.util.ServiceLoader$1.next(Unknown Source)
    at com.levigo.jbig2.util.log.LoggerFactory.getLogger(LoggerFactory.java:42)
    at com.levigo.jbig2.util.log.LoggerFactory.getLogger(LoggerFactory.java:48)
    at com.levigo.jbig2.JBIG2ImageReader.<clinit>(JBIG2ImageReader.java:45)
    at com.levigo.jbig2.JBIG2ImageReaderSpi.createReaderInstance(JBIG2ImageReaderSpi.java:116)
    at javax.imageio.spi.ImageReaderSpi.createReaderInstance(Unknown Source)
    at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
    at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
    at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:133)
    at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:54)
    at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
    at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:167)
    at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235)
    at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:125)
    at org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70)
    at org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:409)
    at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:397)
    at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
    at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
    at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
    at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
    at com.pype.html.converter.PdfToHtmlConverter.pdfToHtmlFileWriter(PdfToHtmlConverter.java:91)
    at com.pype.drawings.slicing.VerticalSlicer.convertCompleteSinglePagePdftoHtml(VerticalSlicer.java:540)
    at com.pype.drawings.slicing.VerticalSlicer.convertCompletePdfPageToHtml(VerticalSlicer.java:104)
    at com.pype.pdf.schedules.extractor.ExtractSchedules.generateHtmlFiles(ExtractSchedules.java:344)
    at com.pype.pdf.schedules.extractor.ExtractSchedules.getIdentifiedSchedulesUsingElements(ExtractSchedules.java:218)
    at com.pype.solr.rest.api.ExtractPDFDrawing.processUploadedPDFFile(ExtractPDFDrawing.java:511)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:502)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:651)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:501)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:754)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1376)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Unknown Source)

After searching the little bit more about this error, no clue is there. If anybody has some idea, please give some suggestions on this.

Thanks


Answer:

Please update to the latest version of the jbig2 decoder, which is 3.0.2. The jbig2 decoder is now a part of Apache PDFBox, thanks to levigo solutions GmbH. For maven, use this:

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>jbig2-imageio</artifactId>
        <version>3.0.2</version>
    </dependency>

Or use the direct download.

Question:

I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.

The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.

// Create a document and add a page to it
        PDDocument document = new PDDocument();
        PDPage page = new PDPage();
        document.addPage( page );

// Create a new font object selecting one of the PDF base fonts
        PDFont font = PDType1Font.HELVETICA_BOLD;

// Start a new content stream which will "hold" the to be created content
        PDPageContentStream contentStream = new PDPageContentStream(document, page);

// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
        contentStream.beginText();
        contentStream.setFont( font, 12 );
        contentStream.moveTextPositionByAmount( 100, 700 );
        contentStream.drawString( "<h1>HelloWorld</h1>" );
        contentStream.endText();

// Make sure that the content stream is closed:
        contentStream.close();

// Save the results and ensure that the document is properly closed:
        document.save( "Hello World.pdf");
        document.close();

    }

Answer:

PDFBox does not know HTML, at least not for creating content.

Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.

E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".

Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.

Question:

I`m trying to get a remote pdf file and perform some actions on it. For this, I use PDFBox. I can't get my hands on the tools package in order to import ImageIOUtil & PDFText2HTML

I added the pdfbox 2.0.6 via Maven dependency after searching the web I tried changing it to the 2.0.7 and 2.0.5

I used the location:

import org.apache.pdfbox.tools.PDFText2HTML;

import org.apache.pdfbox.tools.imageio.ImageIOUtil;

Which are specified in the Javadoc here: https://pdfbox.apache.org/docs/2.0.5/javadocs/org/apache/pdfbox/tools/imageio/ImageIOUtil.html https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/tools/PDFText2HTML.html

But I get 'Cannot resolve symbol "tools"'

my pom:

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>2.0.7</version>
</dependency>
<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>fontbox</artifactId>
  <version>2.0.7</version>
</dependency>

my class imports:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;
import org.apache.pdfbox.tools.PDFText2HTML;
import org.apache.pdfbox.text.PDFTextStripper;

Answer:

Use pdfbox-tools:

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox-tools</artifactId>
  <version>2.0.15</version>
</dependency>

And update all (also pdfbox and fontbox) to the current version, which is 2.0.15. Lots of bugs fixed (including a few security issues) and improvements made.

Question:

I am reading from a pdf using pdfbox and apparently, at least on a Windows-based framework, for the line break it uses a unicode as such &#13;&#10.

My question is that how can I prevent this line breaking character to be concatenated to the string in below code?

tokenizer =new StringTokenizer(Text,"\\.");
while(tokenizer.hasMoreTokens())
{
    String x= tokenizer.nextToken();
    flag=0;
    for(final String s :x.split(" ")) {
       if(flag==1)
          break;
       if(Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
          sum+=x+"."; //here need first to check for "&#13;&#10"
                      // before concatenating the String "x" to String "sum"
          flag=1;
       }
   }
}

Answer:

You should discard the line separators when you split; e.g.

for (final String s : x.split("\\s+")) {

That is making the word separator one or more whitespace characters.

(Using trim() won't work in all cases. Suppose that x contains "word\r\nword". You won't split between the two words, and s will be "word\r\nword" at some point. Then s.trim() won't remove the line break characters because they are not at the ends of the string.)


UPDATE

I just spotted that you are actually appending x not s. So you also need to do something like this:

sum += x.replaceAll("\\s+", " ") + "."

That does a bit more than you asked for. It replaces each whitespace sequence with a single space.


By the way, your code would be simpler and more efficient if you used a break to get out of the loop rather than messing around with a flag. (And Java has a boolean type ... for heavens sake!)

   if (Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
       sum += ....
       break;
   }

Question:

I have code ;

private void convert(ByteArrayInputStream byteArrayInputStream) {
        try {
            PDDocument pdf = PDDocument.load(byteArrayInputStream);
            PDFDomTree parser = new PDFDomTree();
            Document dom = parser.createDOM(pdf);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

Return NoSuchMethodError when this line;

Document dom = parser.createDOM(pdf);

Error :

NoSuchMethodError: com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;

My jar list :...

activation-1.1
animal-sniffer-annotations-1.17
animal-sniffer-annotations-1.18
ant-1.6.5
antlr-2.7.7
antlr4-runtime-4.5.3
aopalliance-1.0
asm-5.0.3
asm-commons-5.0.3
asm-tree-5.0.3
asm-util-5.0.3
atmosphere-runtime-2.4.24.vaadin1
autocomplete-0.2.4
bcprov-jdk15-1.45
calendar-component-2.0.1
cas-client-core-3.3.3
checker-qual-2.5.2
checker-qual-2.8.1
classmate-1.3.0
colt-1.2.0
commons-codec-1.6
commons-collections-3.1
commons-collections-3.2.2
commons-collections4-4.1
commons-httpclient-3.1
commons-io-2.5
commons-lang-2.1
commons-lang3-3.0
commons-logging-1.1.3
commons-net-3.6
core-2.4.0
core-3.3.0
cssbox-4.15
cssparser-0.9.18
curvesapi-1.04
dom4j-1.6.1
ehcache-2.10.3
EnlilWebClient-1
error_prone_annotations-2.2.0
error_prone_annotations-2.3.2
esapi-2.0GA
exporter-2.0.0
failureaccess-1.0.1
failureaccess-1.0
FastInfoset-1.2.13
flute-1.3.0.gg2
fontbox-2.0.4
FontVerter-1.2.22
gelfclient-1.4.0
gentyref-1.2.0.vaadin1
gmbal-3.1.0-b001
gmbal-api-only-3.1.0-b001
gson-2.8.1
guava-28.1-jre
gwt-dev-2.8.2
gwt-elemental-2.8.2
gwt-user-2.8.2
ha-api-3.1.9
hamcrest-core-1.3
hibernate-commons-annotations-5.0.1.Final
hibernate-core-5.2.12.Final
hibernate-ehcache-5.2.12.Final
hibernate-jpa-2.1-api-1.0.0.Final
hibernate-validator-5.1.0.CR1
htmlunit-2.19
htmlunit-core-js-2.17
httpclient-4.3.1
httpcore-4.3
httpmime-4.5.1
icu4j-50.1.1
inputmask-1.0.2
istack-commons-runtime-2.19
itextpdf-5.5.10
j2objc-annotations-1.1
j2objc-annotations-1.3
jackson-annotations-2.5.0
jackson-core-2.4.1
jackson-databind-2.5.2
jackson-datatype-hibernate4-2.4.1
jai-imageio-core-1.3.1
jandex-2.0.3.Final
javaee-api-8.0
javaee-web-api-8.0
javase-2.4.0
javase-3.3.0
javassist-3.20.0-GA
javax.annotation-api-1.2
javax.mail-1.6.0
javax.servlet-api-3.1.0
javax.xml.rpc-api-1.1
javax.xml.soap-api-1.3.7
jaxb1-impl-2.2.4-1
jaxb-api-2.2.12-b140109.1041
jaxb-core-2.2.10-b140802.1033
jaxb-impl-2.2.10-b140802.1033
jaxrpc-api-1.1
jaxrpc-impl-1.1.3_01
jaxrpc-spi-1.1.3_01
jaxws-api-2.2.11
jaxws-rt-2.2.10
jboss-logging-3.3.0.Final
jboss-transaction-api_1.2_spec-1.0.1.Final
jcip-annotations-1.0
jcommander-1.48
jcommon-1.0.23
jersey-bundle-1.19
jersey-core-1.19
jersey-multipart-1.19
jersey-server-1.19
jetty-continuation-9.4.8.v20171121
jfreechart-1.0.19
jmac-1.0-rev-1
joda-time-1.6.2
jsinterop-annotations-1.0.2
jsinterop-annotations-1.0.2-sources
json-simple-1.1.1
jsoup-1.11.2
jsr105-api-1.0.1
jsr105-impl-1.0.2
jsr181-api-1.0-MR1
jsr305-1.3.9
jsr305-3.0.2
jsr311-api-1.1.1
jstyleparser-3.3
jul-to-slf4j-1.6.1
junit-4.11
kerberos-wss-extension-1.0
log4j-1.2.17
log4j-over-slf4j-1.6.1
mail-1.4.5
management-api-3.2.1-b001
messagebox-4.0.21
metro-cm-api-2.3.1
metro-commons-2.3.1
metro-config-api-2.3.1
metro-runtime-api-2.3.1
mimepull-1.9.4
nekohtml-1.9.22
netty-all-4.0.29.Final
not-yet-commons-ssl-0.3.9
ojdbc-8
opensaml-2.5.1-1
openws-1.4.2-1
PasswordEncoder-0.0.1
pdf2dom-1.8
pdfbox-2.0.4
poi-3.16
poi-ooxml-3.16
poi-ooxml-schemas-3.16
policy-2.4
popupbutton-3.0.0
popupextension-1.0.1
product-tour-0.5
reflections-0.9.11
resolver-20050927
saaj-impl-1.3.25
sac-1.3
saml-jaxb10-bindings-1.0
serializer-2.7.1
slf4j-api-1.7.1
slf4j-simple-1.7.25
soaptcp-api-2.3.1
spring-aop-4.3.9.RELEASE
spring-beans-4.3.9.RELEASE
spring-context-4.3.9.RELEASE
spring-core-4.3.9.RELEASE
spring-expression-4.3.9.RELEASE
spring-security-cas-4.2.3.RELEASE
spring-security-config-4.2.3.RELEASE
spring-security-core-4.2.3.RELEASE
spring-security-web-4.2.3.RELEASE
spring-web-4.3.9.RELEASE
stax2-api-3.1.1
stax-api-1.0.1
stax-ex-1.7.7
streambuffer-1.5.3
tapestry-4.0.2
timerextension-0.2.1
unbescape-1.1.6.RELEASE
vaadin-autocomplete-1.1.1
vaadin-charts-4.0.0
vaadin-charts-model-4.0.0
vaadin-client-8.5.1
vaadin-client-compiled-8.5.1
vaadin-client-compiler-8.5.1
vaadin-combobox-multiselect-2.6
vaadin-context-menu-2.1.0
vaadin-onoffswitch-1.1.0
vaadin-push-8.5.1
vaadin-sass-compiler-0.9.13
vaadin-scrollable-panel-2.0
vaadin-server-8.5.1
vaadin-shared-8.5.1
vaadin-slf4j-jdk14-1.6.1
vaadin-sliderpanel-2.2.0
vaadin-spring-3.1.0
vaadin-themes-8.5.1
validation-api-1.0.0.GA-sources
validation-api-1.1.0.Final
velocity-1.5
webservices-api-2.3.1
webservices-rt-2.3.1
webservices-tools-2.3.1
websocket-api-9.2.13.v20150730
websocket-client-9.2.13.v20150730
websocket-common-9.2.13.v20150730
woodstox-core-asl-4.2.0
wsit-api-2.3.1
wsit-impl-2.3.1
wsmc-api-2.3.1
wsrm-api-2.3.1
wssx-api-2.3.1
wstx-api-2.3.1
xalan-2.7.1
xercesImpl-2.10.0
xercesImpl-2.12.0
xml-apis-1.4.01
xmlbeans-2.6.0
xml-resolver-1.2
xmlsec-1.4.4
xmltooling-1.3.2-1

I have lots of library.But I dont know which one of these libraries may be conflict.My project is large project.I hope someone know this problem.Thank you..


Answer:

As far as I checked, when Maven resolves more dependencies to the same library, but different versions, it takes the first version it finds and ignores the others. In your case, you probably explicitly declared a dependency to Guava 28.1. However the pdfbox requires a different version, which you can't see amongs your libraries, because it is ignored. I suggest you to:

  1. Remove as much dependencies from your project as you can, leave there only the dependency to pdfbox.
  2. Check, what version it needs. The command mvn dependency:tree -Dverbose might help you.
  3. Revert your dependencies to the original state, set the Guava dependency to that version.
  4. Pray, that another library doesn't need different version of Guava as well.

Edit: You might find this thread usefull. It talks about ignoring the other versions of the same jar.

maven dependency plugin ignores dependency versions?

Edit 2: In my case, the Guava 15.0 was needed.

Question:

We're trying to add bookmark or table of contents metadata to a PDF that is generated from HTML. How do you signal PDFBox/OpenHTMLtoPDF to create the bookmarks/TOC?

<div class="bkmrk0">Header One</div>
<div class="bkmrk1">Header Two for List</div>
...

<div class="bkmrk2">Header Three Text</div>
...
Bookmark example on left, TOC on right.


Answer:

To create Bookmarks:

<bookmarks>
 <bookmark name="Font Support" href="#fonts-feature-group"/>
 <bookmark name="RTL &amp; BIDI Text Support" href="#rtl-feature-group"/>
 <bookmark name="Forms Support" href="#forms-feature-group"/>
 <bookmark name="List Support" href="#lists-feature-group"/>
 <bookmark name="Z-Index Support" href="#z-index-feature-group"/>
 <bookmark name="SVG Support (Experimental)" href="#svg-feature-group"/>
</bookmarks>

The href is the id of the element where the bookmark should link to.

https://github.com/danfickle/openhtmltopdf/blob/14aef95364684fe3c7b7207bcb1246e5c3af0335/openhtmltopdf-examples/src/main/resources/visualtest/html/bookmark-head-nested.html

Question:

public class ExtractText {

/**
 * private constructor.
*/
private ExtractText()
{
    //static class
}


public static void main( String[] args ) throws Exception
{


     if(l!=null)
     {
         System.out.println("HERE"+l.length);
         deleteSubs(op);
         System.out.println("Then"+l.length);
     }
     else
     {
         System.out.println("WHERE");
     }

    File y=new File(imgDes);

    if(!y.exists())
    {
        y.mkdirs();
    }
   File z=new File(imgDestination);

   if(!z.exists())
   {
       z.mkdirs();
   }
  File fr=new File(outputFile);

  if(!fr.isDirectory())
  {
      fr.delete();
  }
    // Defaults to text files
    String ext = ".txt";
    int startPage = 1;
    int endPage = Integer.MAX_VALUE;
     Writer output = null;
     PDDocument document =null;       
    try
    {
        try
        {
            URL url = new URL( pdfFile );

            document = PDDocument.load(url, force);

            String fileName = url.getFile();
            if( outputFile == null && fileName.length() >4)
            {
                outputFile = new File( fileName.substring( 0, fileName.length() -4 ) + ext ).getName();
            }
        }
        catch( MalformedURLException e)
        {
            document = PDDocument.load(pdfFile, force);

            if( outputFile == null && pdfFile.length() >4 )
            {
                outputFile = pdfFile.substring( 0, pdfFile.length() -4 ) + ext;
            }
        }

            //document.print();
        if( document.isEncrypted() )
        {
            StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password );
            document.openProtection(sdm);
            AccessPermission ap = document.getCurrentAccessPermission();
            if( ! ap.canExtractContent() )
            {
                throw new IOException("You do not have permission to extract text" );
            }
        }

        if ((encoding == null) && (toHTML))
        {
            encoding = "UTF-8";
        }
        if( toConsole )
        {
            output = new OutputStreamWriter(System.out);                                                  
        }
        else
        {
          if( encoding != null )
           {
                output = new OutputStreamWriter(new FileOutputStream( outputFile ), encoding );
           }
         else
            {
                    //use default encoding
                output = new OutputStreamWriter(new FileOutputStream( outputFile ) );
            }
        }

        PDFTextStripper4 stripper = null;

        if(toHTML)
        {
            stripper = new PDFText2HTML(encoding);
        }

        else
        {
            stripper = new PDFTextStripper4(encoding);
        }
        File f= new File(imgDestination);
        PDDocument pd;

        int i=0;
        if(f.exists())
        {
            pd=PDDocument.load(pdfFile);
            PDFontDescriptor fd;
            fd = new PDFontDescriptorDictionary();

            List<PDPage> li=pd.getDocumentCatalog().getAllPages();
            for(PDPage page:li)
            {
              PDResources pdr=page.getResources();

              Map<String, PDFont> m=pdr.getFonts();
              PDStream pst;
            for(PDFont pdd:m.values())
            {
                   System.out.println("----------"+pdd.getBaseFont());
                   pdd.getFontDescriptor();
                    fd = pdd.getFontDescriptor();

                   pdd.setFontDescriptor((PDFontDescriptorDictionary)fd);
                   System.out.println("tititititi"+pdd.getFontEncoding());
                   if(pdd.isType1Font())
                   {
                    pst=((PDFontDescriptorDictionary) fd).getFontFile3();
                    System.out.println("In If "+pst);
                   if(pst!= null)
                   {
                       FileOutputStream fos = new FileOutputStream(new File(imgDestination+pdd.getBaseFont().toString()+".pfb"));
                       IOUtils.copy(pst.createInputStream(), fos);
                       i++;
                       System.out.println(i);
                       fos.close();
                    }
                   }
                   else 
                       if(pdd.isTrueTypeFont())
                       {
                           pst= ((PDFontDescriptorDictionary) fd).getFontFile2();
                           System.out.println("In Else-if"+pst);
                           if (pst!= null)
                           {
                               FileOutputStream fos = new FileOutputStream(new File(imgDestination+pdd.getBaseFont().toString()+".ttf"));
                               IOUtils.copy(pst.createInputStream(), fos);
                               i++;
                               System.out.println(i);
                               fos.close();
                           }
                       }
                       else
                           if(pdd.isSymbolicFont())
                           {
                               System.out.println("Symbol.......");
                           }
                   else
                   {

                       System.out.println("In Else");



                   }
               }

            }

        int pageCount = document.getDocumentCatalog().getAllPages().size();
        for (int p = 0; p < pageCount; ++p)
        {
            System.out.println("I am in for loop");
             stripper.setForceParsing( force );
             stripper.setSortByPosition( true );
             stripper.setShouldSeparateByBeads(separateBeads);
            stripper.setStartPage( p);
            stripper.setEndPage( p);
            stripper.writeText( document, output );
            FileOutputStream fos = new FileOutputStream(new File(f5+(p+1)+".html"));
            output.close();


        }

        PDDocumentInformation info = document.getDocumentInformation();
        System.out.println( "Page Count=" + document.getNumberOfPages());
        System.out.println( "Title=" + info.getTitle());
        System.out.println( "Author=" + info.getAuthor());
        System.out.println( "Subject=" + info.getSubject() );
        System.out.println( "Keywords=" + info.getKeywords() );
        System.out.println( "Creator=" + info.getCreator() );
        System.out.println( "Producer=" + info.getProducer() );
        System.out.println( "Creation Date=" + info.getCreationDate() );
        System.out.println( "Modification Date=" + info.getModificationDate());
        System.out.println( "Trapped=" + info.getTrapped());


   }
    }catch(Exception e)
     {
        e.printStackTrace();
     }
    finally
    {
        if( output != null)
        {
            output.close();
        }
        if( document != null )
        {
            document.close();
        }
    }
}


private static void deleteSubs(File op) 
{
    // TODO Auto-generated method stub
     File[] files = op.listFiles();
     System.out.print("In delete folder");
        if(files!=null) 
        {
            //some JVMs return null for empty dirs
            for(File f: files) 
            {
                if(f.isDirectory()) 
                {
                    deleteSubs(f);
                } 
                else 
                {
                    f.delete();
                }
            }
        }
        op.delete();
}

}

now i am able to get entire pdf to a html file i.e.. I am extracting text only not images but i want to get every page of a pdf in to single html so any solution for this is quite helpful to me.. ThankYou


Answer:

The answer is in your question: just set

    stripper.setStartPage( p );
    stripper.setEndPage( p );

accordingly. So you would loop somewhat like this:

int pageCount = document.getDocumentCatalog().getAllPages().size();
for (int p = 0; p < pageCount; ++p)
{
    //... your options
    stripper.setStartPage(p);
    stripper.setEndPage(p);
    FileOutputStream fos = new FileOutputStream(new File(f5+(p+1)+".html"));
    stripper.writeText(document, fos);
    fos.close();
}

Btw if you get an exception relating to the sorting comparator, use setSortByPosition(false), or wait for version 1.8.8 where this problem is fixed.