Hot questions for Using PDFBox in text

Question:

I am trying to extract the hyperlink information from a PDF using PDFBox but I am unsure how to get

for( Object p : pages ) {
    PDPage page = (PDPage)p;

    List<?> annotations = page.getAnnotations();
    for( Object a : annotations ) {
        PDAnnotation annotation = (PDAnnotation)a;

        if( annotation instanceof PDAnnotationLink ) {
            PDAnnotationLink link = (PDAnnotationLink)annotation;
            System.out.println(link.toString());
            System.out.println(link.getDestination());

        }
    }

}

I want to extract the url of the hyperlink destination and the text of the hyperlink. How can one do this?

Thanks


Answer:

Use this code from the PrintURLs sample code from the source code download:

for( PDPage page : doc.getPages() )
{
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    //first setup text extraction regions
    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDRectangle rect = link.getRectangle();
            //need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if( rotation == 0 )
            {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            }
            else if( rotation == 90 )
            {
                //do nothing
            }

            Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
            stripper.addRegion( "" + j, awtRect );
        }
    }

    stripper.extractRegions( page );

    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion( "" + j );
            if( action instanceof PDActionURI )
            {
                PDActionURI uri = (PDActionURI)action;
                System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
            }
        }
    }
}

It works in two parts, one is getting the URL which is easy, the other is getting the URL text, which is done with a text extraction at the rectangle of the annotation.

Question:

I've just started using Apache PDFBox and been experimenting with various examples I've found.

However, I haven't been able to find an easy way to move to the next line when adding text.

E.g.

PDPageContentStream content = new PDPageContentStream(document, page);
PDFont font = PDType1Font.HELVETICA;
content.beginText();
content.setFont(font, 12);
content.moveTextPositionByAmount(x, y);
content.drawString("Some text.");
content.endText();

To add another line of text underneath I had to repeatedly experiment with the value of y in moveTextPositionByAmount until it wasn't overwriting the previous line.

Is there a more intuitive way to figure out what the coordinates of the next line are?

TIA


Answer:

The PDFBox API allows low-level content generation. This implies that you have to do (but also that you are enabled to do) much of the layout work yourself, among that deciding how much to move down to get to the next baseline.

That distance (called leading in this context) depends on a number of factors:

  • the font size used (obviously)
  • how tightly or loosely spaced the text shall appear
  • the presence of elements on the lines involved positioned outside the regular line, e.g. superscripts, subscripts, formulas, ...

The standard is arranged so that the nominal height of tightly spaced lines of text is 1 unit for a font drawn at size 1. Thus, usually you will use a leading of 1..1.5 times the font size unless there is material on the line reaching beyond it.

BTW, if you have to forward to the next line by the same amount very often, you can use the combination of the PDPageContentStream methods setLeading and newLine instead of moveTextPositionByAmount:

content.setFont(font, 12);
content.setLeading(14.5f);
content.moveTextPositionByAmount(x, y);
content.drawString("Some text.");
content.newLine();
content.drawString("Some more text.");
content.newLine();
content.drawString("Still some more text.");

PS: It looks like moveTextPositionByAmount will be deprecated in the 2.0.0 version and be replaced by newLineAtOffset.

PPS: As the OP indicates in a comment,

There is no PDPageContentStream method called setLeading. I'm using PDFBox version 1.8.8.

Indeed, I was looking at the current 2.0.0-SNAPSHOT development version. They are currently implemented like this:

/**
 * Sets the text leading.
 *
 * @param leading The leading in unscaled text units.
 * @throws IOException If there is an error writing to the stream.
 */
public void setLeading(double leading) throws IOException
{
    writeOperand((float) leading);
    writeOperator("TL");
}

/**
 * Move to the start of the next line of text. Requires the leading to have been set.
 *
 * @throws IOException If there is an error writing to the stream.
 */
public void newLine() throws IOException
{
    if (!inTextMode)
    {
        throw new IllegalStateException("Must call beginText() before newLine()");
    }
    writeOperator("T*");
}

One can easily implement external helper methods doing the equivalent using appendRawCommands((float) leading); appendRawCommands(" TL"); and appendRawCommands("T*");

Question:

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line. I can't find anything related to how to get that information though. I know pdfbox has a class called TextPosition, but I can't find out how to get a TextPosition object from the PDDocument either. How do I get the location information of a line of text from a pdf?


Answer:

In general

To extract text (with or without extra information like positions, colors, etc.) using PDFBox, you instantiate a PDFTextStripper or a class derived from it and use it like this:

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

(There are a number of PDFTextStripper attributes allowing you to restrict the pages text is extracted from.)

In the course of the execution of getText the content streams of the pages in question (and those of form xObjects referenced from those pages) are parsed and text drawing commands are processed.

If you want to change the text extraction behavior, you have to change this text drawing command processing which you most often should do by overriding this method:

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

If you additionally need to know when a new line starts, you may also want to override

/**
 * Write the line separator value to the output stream.
 * @throws IOException
 *             If there is a problem writing out the lineseparator to the document.
 */
protected void writeLineSeparator( ) throws IOException
{
    output.write(getLineSeparator());
}

writeString can be overridden to channel the text information into separate members (e.g. if you might want a result in a more structured format than a mere String) or it can be overridden to simply add some extra information into the result String.

writeLineSeparator can be overridden to trigger some specific output between lines.

There are more methods which can be overridden but you are less likely to need them in general.


In the case at hand

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.

This can be implemented as follows (simply adding the information at the start of each line):

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void startPage(PDPage page) throws IOException
    {
        startOfLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        startOfLine = true;
        super.writeLineSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        if (startOfLine)
        {
            TextPosition firstProsition = textPositions.get(0);
            writeString(String.format("[%s]", firstProsition.getXDirAdj()));
            startOfLine = false;
        }
        super.writeString(text, textPositions);
    }
    boolean startOfLine = true;
};

text = stripper.getText(document);

(ExtractText.java method extractLineStart tested by testExtractLineStartFromSampleFile)

Question:

I'm trying to add underlined text to a blank pdf page using PDFBox, but I haven't been able to find any examples online. All questions on stackoverflow point to extracting underlined text, but not creating it. Has this function not been implemented for PDFBox? Looking at the PDFBox documentation, it seems that fonts are pre-rendered as bold, italic, and regular.

For example, Times New Roman Regular is denoted as:

PDFont font = PDType1Font.TIMES_ROMAN. 

Times New Roman Bold is denoted as:

PDFont font = PDType1Font.TIMES_BOLD

Italicized is denoted as:

PDFont font = PDType1Font.TIMES_ITALIC

There seems to be no underlined option. Is there anyway to underline text, or is this not a feature?


Answer:

I'm not sure if this is a better alternative or not, but I followed Tilman Hausherr and drew a line in comparison to my text. For instance, I have the following:

public processPDF(int xOne, int yOne, int xTwo, int yTwo)
{
    //create pdf and its contents for one page
    PDDocument document = new PDDocument();
    File file = new File("hello.pdf");
    PDPage page = new PDPage();
    PDFont font = PDType1Font.HELVETICA_BOLD;
    PDPageContentStream contentStream;

    try {
        //create content stream
        contentStream = new PDPageContentStream(document, page);

        //being to create our text for our page
        contentStream.beginText();
        contentStream.setFont( font, largeTitle);

        //position of text
        contentStream.moveTextPositionByAmount(xOne, yOne, xTwo, yTwo);
        contentStream.drawString("Hello");
        contentStream.endText();

        //begin to draw our line
        contentStream.drawLine(xOne, yOne - .5, xTwo, yYwo - .5);

        //close and save document
        document.save(file);
        document.close();

        } catch (Exception e) {
            e.printStackTrace();
        }
}

where our parameters xOne, yOne, xTwo, and yTwo are our locations of the text. The line has us subtract .5 from yOne and yTwo to move it a pinch below our text location, ultimately setting it to look like underlined text.

There may be better ways, but this was the route I went.

Question:

i'm trying to extract text with coordinates from a pdf file using PDFBox.

I mixed some methods/info found on internet (stackoverflow too), but the problem i have the coordinates doesnt'seems to be right. When i try to use coordinates for drawing a rectangle on top of tex, for example, the rect is painted elsewhere.

This is my code (please don't judge the style, was written very fast just to test)

TextLine.java

    import java.util.List;
    import org.apache.pdfbox.text.TextPosition;

    /**
     *
     * @author samue
     */
    public class TextLine {
        public List<TextPosition> textPositions = null;
        public String text = "";
    }

myStripper.java

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.TextPosition;

    /*
     * To change this license header, choose License Headers in Project Properties.
     * To change this template file, choose Tools | Templates
     * and open the template in the editor.
     */

    /**
     *
     * @author samue
     */
    public class myStripper extends PDFTextStripper {
        public myStripper() throws IOException
        {
        }

        @Override
        protected void startPage(PDPage page) throws IOException
        {
            startOfLine = true;
            super.startPage(page);
        }

        @Override
        protected void writeLineSeparator() throws IOException
        {
            startOfLine = true;
            super.writeLineSeparator();
        }

        @Override
        public String getText(PDDocument doc) throws IOException
        {
            lines = new ArrayList<TextLine>();
            return super.getText(doc);
        }

        @Override
        protected void writeWordSeparator() throws IOException
        {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();

            super.writeWordSeparator();
        }


        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextLine tmpline = null;

            if (startOfLine) {
                tmpline = new TextLine();
                tmpline.text = text;
                tmpline.textPositions = textPositions;
                lines.add(tmpline);
            } else {
                tmpline = lines.get(lines.size() - 1);
                tmpline.text += text;
                tmpline.textPositions.addAll(textPositions);
            }

            if (startOfLine)
            {
                startOfLine = false;
            }
            super.writeString(text, textPositions);
        }

        boolean startOfLine = true;
        public ArrayList<TextLine> lines = null;

    }

click event on AWT button

 private void jButton1MouseClicked(java.awt.event.MouseEvent evt) {                                      
    // TODO add your handling code here:
    try {
        File file = new File("C:\\Users\\samue\\Desktop\\mwb_I_201711.pdf");
        PDDocument doc = PDDocument.load(file);

        myStripper stripper = new myStripper();

        stripper.setStartPage(1); // fix it to first page just to test it
        stripper.setEndPage(1);
        stripper.getText(doc);

        TextLine line = stripper.lines.get(1); // the line i want to paint on

        float minx = -1;
        float maxx = -1;

        for (TextPosition pos: line.textPositions)
        {
            if (pos == null)
                continue;

            if (minx == -1 || pos.getTextMatrix().getTranslateX() < minx) {
                minx = pos.getTextMatrix().getTranslateX();
            }
            if (maxx == -1 || pos.getTextMatrix().getTranslateX() > maxx) {
                maxx = pos.getTextMatrix().getTranslateX();
            }
        }

        TextPosition firstPosition = line.textPositions.get(0);
        TextPosition lastPosition = line.textPositions.get(line.textPositions.size() - 1);

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();
        float w = (maxx - minx) + lastPosition.getWidth();
        float h = lastPosition.getHeightDir();

        PDPageContentStream contentStream = new PDPageContentStream(doc, doc.getPage(0), PDPageContentStream.AppendMode.APPEND, false);

        contentStream.setNonStrokingColor(Color.RED);
        contentStream.addRect(x, y, w, h);
        contentStream.fill();
        contentStream.close();

        File fileout = new File("C:\\Users\\samue\\Desktop\\pdfbox.pdf");
        doc.save(fileout);
        doc.close();
    } catch (Exception ex) {

    }
}                                     

any suggestion? what am i doing wrong?


Answer:

This is just another case of the excessive PdfTextStripper coordinate normalization. Just like you I had thought that by using TextPosition.getTextMatrix() (instead of getX() and getY) one would get the actual coordinates, but no, even these matrix values have to be corrected (at least in PDFBox 2.0.x, I haven't checked 1.8.x) because the matrix is multiplied by a translation making the lower left corner of the crop box the origin.

Thus, in your case (in which the lower left of the crop box is not the origin), you have to correct the values, e.g. by replacing

        float x = minx;
        float y = firstPosition.getTextMatrix().getTranslateY();

by

        PDRectangle cropBox = doc.getPage(0).getCropBox();

        float x = minx + cropBox.getLowerLeftX();
        float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();

Instead of

you now get

Obviously, though, you will also have to correct the height somewhat. This is due to the way the PdfTextStripper determines the text height:

    // 1/2 the bbox is used as the height todo: why?
    float glyphHeight = bbox.getHeight() / 2;

(from showGlyph(...) in LegacyPDFStreamEngine, the parent class of PdfTextStripper)

While the font bounding box indeed usually is too large, half of it often is not enough.

Question:

I have PDF files which has text in four different orientations like horizontal aligned ,vertically aligned and inversely aligned. While using PDFBox API to read the text from pdf, I am getting good output for horizontally aligned text but not in other cases. For example if "italic" word is horizontally aligned , the output is "italic". If it is vertically aligned, then the output is splitting across the lines as "it a li c"(here "it", "a","li","c" are in different lines). I want to know whether there are any ways to get good outptut even for vertically and inversely aligned text.


Answer:

You can override processTextPosition() method of PDFTextStripper and write the logic to get direction, x and y values for each character. By grouping the characters based on its direction, you can crop them seperately.

Question:

I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose. So far I have done extraction of text from comments using below code.

PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
    List allPages = pddDocument.getDocumentCatalog().getAllPages();
    for (int i = 0; i < allPages.size(); i++) {
    int pageNum = i + 1;
    PDPage page = (PDPage) allPages.get(i);
    List<PDAnnotation> la = page.getAnnotations();
    if (la.size() < 1) {
    continue;
    }
    System.out.println("Total annotations = " + la.size());
    System.out.println("\nProcess Page " + pageNum + "...");
    // Just get the first annotation for testing
    PDAnnotation pdfAnnot = la.get(0); 
    System.out.println("Getting text from comment = " + pdfAnnot.getContents());

Now I need to get the highlighted text, any code example will be highly appreciated.


Answer:

I Hope this answer help everyone who is facing the same problem.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}

Question:

I have just passed from PdfBox 1.8 to 2.0.0 and there are quite significant differences. Before to write a text on an existing pdf page I used drawString. In 2.0.0 draw string is deprecated but showText does not work in a block text.

My code in 1.8:

 contentStream.beginText()
 contentStream.moveTextPositionByAmount(250, 665)
 contentStream.drawString("1  2 3 4 5 6    7  8  9   1 0")
 contentStream.endText()

My code in 2.0

  PDDocument newPdf=null
  newPdf=PDDocument.load(sourcePdfFile)
  PDPage firstPage=newPdf.getPage(0)
  PDPageContentStream contentStream = new PDPageContentStream(newPdf, firstPage, PDPageContentStream.AppendMode.APPEND,true,true)
  contentStream.setFont(pdfFont, fontSize)
  contentStream.beginText()
  contentStream.lineTo(200,685)
  contentStream.showText("John")
  contentStream.endText()

But it does not working...

Anyone has any idea about how can I write text as in 1.8


Answer:

LineTo is to draw a line. What you want is newLineAtOffset (the deprecation notice of moveTextPositionByAmount says so), so your code is like this:

  PDDocument newPdf = PDDocument.load(sourcePdfFile);
  PDPage firstPage=newPdf.getPage(0);
  PDFont pdfFont= PDType1Font.HELVETICA_BOLD;
  int fontSize = 14;
  PDPageContentStream contentStream = new PDPageContentStream(newPdf, firstPage, PDPageContentStream.AppendMode.APPEND,true,true);
  contentStream.setFont(pdfFont, fontSize);
  contentStream.beginText();
  contentStream.newLineAtOffset(200,685);
  contentStream.showText("John");
  contentStream.endText();
  contentStream.close(); // don't forget that one!

Question:

I have an existing pdf file with form fields, which can be filled by a user. This form fields have a font and text alignment which were defined when creating the pdf file.

I use Apache PDFBox to find the form field in the pdf:

PDDocument document = PDDocument.load(pdfFile);
PDAcroForm form = document.getDocumentCatalog().getAcroForm();

PDTextField textField = (PDTextField)form.getField("anyFieldName");
if (textField == null) {
  textField = (PDTextField)form.getField("fieldsContainer.anyFieldName");
}

List<PDAnnotationWidget> widgets = textField.getWidgets();
PDAnnotationWidget annotation = null;
if (widgets != null && !widgets.isEmpty()) {
  annotation = widgets.get(0);

  /* font and alignment needed here */
}

If I set the content of the form field with

textField.setValue("This is the text");

then the text in the form field has the same font and alignment as predefined for this field.

But I need the alignment and the font for a second field (which is not a form field btw.).

How to find out which alignment (left, center, right) and which font (I need a PDType1Font and its size in point) is defined for this form field? Sth. like font = annotation.getFont() and alignment = annotation.getAlignment() which both do not exist.

How to get font and alignment?

    1. 17: Edit

Where I need the font is this:

PDPageContentStream content = new PDPageContentStream(document, page, AppendMode.APPEND, false);
content.setFont(font, size); /* Here I need font and size from the text field above */
content.beginText();
content.showText("My very nice text");
content.endText();

I need the font for the setFont() call.


Answer:

To get the PDFont, do this:

String defaultAppearance = textField.getDefaultAppearance(); // usually like "/Helv 12 Tf 0 0 1 rg"
Pattern p = Pattern.compile("\\/(\\w+)\\s(\\d+)\\s.*");
Matcher m = p.matcher(defaultAppearance);
if (!m.find() || m.groupCount() < 2)
{
    // oh-oh
}
String fontName = m.group(1);
int fontSize = Integer.parseInt(m.group(2));
PDAnnotationWidget widget = textField.getWidgets().get(0);
PDResources res = widget.getAppearance().getNormalAppearance().getAppearanceStream().getResources();
PDFont fieldFont = res.getFont(COSName.getPDFName(fontName));
if (fieldFont == null)
{
    fieldFont = acroForm.getDefaultResources().getFont(COSName.getPDFName(fontName));
}
System.out.println(fieldFont + "; " + fontSize);

This retrieves the font object from the resource dictionary of the resource dictionary of the first widget of your field. If the font isn't there, the default resource dictionary is checked. Note that there are no null checks, you need to add them. At the botton of the code you'll get a PDFont object and a number.

Re alignment, call getQ(), see also here.

Question:


Answer:

You are starting at (0,0) after calling contents.beginText();. Thus if you want to work with absolute positions only, then put only one (absolute) positioning in a contents.beginText();contents.endText(); segment.

Question:

I'm using pdfbox 2.0.1 to parse pdf document like this.

        for (int i = 0; i < 5; i ++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    InputStream in = new ByteArrayInputStream(fileContent);
                    PDDocument document = null;
                    PDFTextStripper stripper;
                    String content;

                    try {
                        document = PDDocument.load(in);

                        stripper = new PDFTextStripper();
                        content = stripper.getText(document).trim();
                    } finally {
                        if (document != null) {
                            document.close();
                        }
                        if (in != null) {
                            in.close();
                        }
                    }
                    System.out.println(content);
                }
            }).start();
        }

Sometimes it happened that cpu runs 100% while parsing pdf concurrently. The stack is as follow:

java.lang.Thread.State: RUNNABLE
at java.util.HashMap.get(HashMap.java:303)
at org.apache.pdfbox.pdmodel.font.encoding.GlyphList.toUnicode(GlyphList.java:231)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:308)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:273)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:668)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:609)
at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:52)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)

GlyphList.java code is:

// Adobe Glyph List (AGL)
private static final GlyphList DEFAULT = load("glyphlist.txt", 4281);


 /**
     * Returns the Unicode character sequence for the given glyph name, or null if there isn't any.
     *
     * @param name PostScript glyph name
     * @return Unicode character(s), or null.
     */
public String toUnicode(String name)
{
    if (name == null)
    {
        return null;
    }

    String unicode = nameToUnicode.get(name);
    if (unicode != null)
    {
        return unicode;
    }

    // separate read/write cache for thread safety
    unicode = uniNameToUnicodeCache.get(name);
    if (unicode == null)
    {
        // test if we have a suffix and if so remove it
        if (name.indexOf('.') > 0)
        {
            unicode = toUnicode(name.substring(0, name.indexOf('.')));
        }
        else if (name.startsWith("uni") && name.length() == 7)
        {
            // test for Unicode name in the format uniXXXX where X is hex
            int nameLength = name.length();
            StringBuilder uniStr = new StringBuilder();
            try
            {
                for (int chPos = 3; chPos + 4 <= nameLength; chPos += 4)
                {
                    int codePoint = Integer.parseInt(name.substring(chPos, chPos + 4), 16);
                    if (codePoint > 0xD7FF && codePoint < 0xE000)
                    {
                        LOG.warn("Unicode character name with disallowed code area: " + name);
                    }
                    else
                    {
                        uniStr.append((char) codePoint);
                    }
                }
                unicode = uniStr.toString();
            }
            catch (NumberFormatException nfe)
            {
                LOG.warn("Not a number in Unicode character name: " + name);
            }
        }
        else if (name.startsWith("u") && name.length() == 5)
        {
            // test for an alternate Unicode name representation uXXXX
            try
            {
                int codePoint = Integer.parseInt(name.substring(1), 16);
                if (codePoint > 0xD7FF && codePoint < 0xE000)
                {
                    LOG.warn("Unicode character name with disallowed code area: " + name);
                }
                else
                {
                    unicode = String.valueOf((char) codePoint);
                }
            }
            catch (NumberFormatException nfe)
            {
                LOG.warn("Not a number in Unicode character name: " + name);
            }
        }
        uniNameToUnicodeCache.put(name, unicode);
    }
    return unicode;
}

so, when we call like this

GlyphList.DEFAULT.toUnicode(code)

the concurrent error occurs(pay attention to var uniNameToUnicodeCache), and PDSimpleFont.toUnicode just did that.

However, it seems that no other ones have met the same problem。i don't know what i said above is right, or wrong. And if it's really a bug, is it fixed?


Answer:

Reviewing the GlyphList class code it becomes apparent that it has not been prepared for multi-threaded use. On the other hand a DEFAULT instance of it is used as a singleton via getAdobeGlyphList concurrently by text extraction code.

This can become an issue in its toUnicode(String) method if the documents in question use glyph names using the inofficial scheme uniXXXX or uXXXX because in such a case this method not only tries to read from the HashMap uniNameToUnicodeCache but also writes to it (adding the found inofficial glyph name for later quick lookup).

If such a write happens concurrently with some other thread's read from the map, indeed a ConcurrentModificationException may occur.

I'd propose changing the GlyphList to either

  • not write to uniNameToUnicodeCache anymore, or
  • synchronize toUnicode(String) or more precisely the uniNameToUnicodeCache reads and writes therein, or
  • make uniNameToUnicodeCache a ConcurrentHashMap instead of a HashMap.

I would expect the third option to perform better than the second one.

Question:

I'm trying to extract text from a PDF which is full of tables. In some cases, a column is empty. When I extract the text from the PDF, the emptys columns are skiped and replaced by a whitespace, therefore, my regulars expressions can't figure out that there was a column with no information at this spot.

Image to a better understanding :

We can see that the columns aren't respected in the extracted text

Sample of my code that extract the text from PDF :

PDFTextStripper reader = new PDFTextStripper();
            reader.setSortByPosition(true);
            reader.setStartPage(page);
            reader.setEndPage(page);
            String st = reader.getText(document);
            List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));

How to maintain the full structure of the original PDF when extracting text from it ?

Thank's a lot.


Answer:

(This originally was the answer (dated Feb 6 '15) to another question which the OP deleted including all answers. Due to the age, the code in the answer was still based on PDFBox 1.8.x, so some changes might be necessary to make it run with PDFBox 2.0.x.)

In comments the OP showed interest in a solution to extend the PDFBox PDFTextStripper to return text lines which attempt to reflect the PDF file layout which might help in case of the question at hand.

A proof-of-concept for that would be this class:

public class LayoutTextStripper extends PDFTextStripper
{
    public LayoutTextStripper() throws IOException
    {
        super();
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        super.startPage(page);
        cropBox = page.findCropBox();
        pageLeft = cropBox.getLowerLeftX();
        beginLine();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        float recentEnd = 0;
        for (TextPosition textPosition: textPositions)
        {
            String textHere = textPosition.getCharacter();
            if (textHere.trim().length() == 0)
                continue;

            float start = textPosition.getTextPos().getXPosition();
            boolean spacePresent = endsWithWS | textHere.startsWith(" ");

            if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
            {
                int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);

                for (; spacesToInsert > 0; spacesToInsert--)
                {
                    writeString(" ");
                    chars++;
                }
            }

            writeString(textHere);
            chars += textHere.length();

            needsWS = false;
            endsWithWS = textHere.endsWith(" ");
            try
            {
                recentEnd = getEndX(textPosition);
            }
            catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
            {
                throw new IOException("Failure retrieving endX of TextPosition", e);
            }
        }
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();
        beginLine();
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
        needsWS = true;
    }

    void beginLine()
    {
        endsWithWS = true;
        needsWS = false;
        chars = 0;
    }

    int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
    {
        int indexNow = charsInLineAlready;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;

        return spacesToInsert;
    }

    float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
    {
        Field field = textPosition.getClass().getDeclaredField("endX");
        field.setAccessible(true);
        return field.getFloat(textPosition);
    }

    public float fixedCharWidth = 3;

    boolean endsWithWS = true;
    boolean needsWS = false;
    int chars = 0;

    PDRectangle cropBox = null;
    float pageLeft = 0;
}

It is used like this:

PDDocument document = PDDocument.load(PDF);

LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5

String text = stripper.getText(document);

fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In my sample documents values from 3..6 were of interest.

It essentially emulates the analogous solution for iText in this answer. Results differ a bit, though, as iText text extraction forwards text chunks and PDFBox text extraction forwards individual characters.

Please be aware that this is merely a proof-of-concept. It especially does not take any rotation into account

Question:

I need to read a plan exported by AutoCAD to PDF and place some markers with text on it with PDFBox. Everything works fine, except the calculation of the width of the text, which is written next to the markers.

I skimmed through the whole PDF specification and read in detail the parts, which deal with the graphic and the text, but to no avail. As far as I understand, the glyph coordinate space is set up in a 1/1000 of the user coordinate space. Hence the width need to be scale up by 1000, but it's still a fraction of the real width.

This is what I am doing to position the text:

float textWidth = font.getStringWidth(marker.id) * 0.043f;
contentStream.beginText();
contentStream.setTextScaling(1, 1, 0, 0);
contentStream.moveTextPositionByAmount(
  marker.endX + marker.getXTextOffset(textWidth, fontPadding),
  marker.endY + marker.getYTextOffset(fontSize, fontPadding));
contentStream.drawString(marker.id);
contentStream.endText();

The * 0.043f works as an approximation for one document, but fails for the next. Do I need to reset any other transformation matrix except the text matrix?

EDIT: A full idea example project is on github with tests and example pdfs: https://github.com/ascheucher/pdf-stamp-prototype

Thanks for your help!


Answer:

Unfortunately the question and comments merely include (by running the sample project) the actual result for two source documents and the description

The annotating text should be center aligned on the top and bottom marker, aligned to the left on the right marker and aligned to the right on the left marker. The alignment is not working for me, as the font.getSTringWidth( .. ) returns only a fraction of what it seems to be. And the discrepance seems to be different in both PDFs.

but not a concrete sample discrepancy to repair.

There are several issues in the code, though, which may lead to such observations (and other ones, too!). Fixing them should be done first; this may already resolve the issues observed by the OP.

Which box to take

The code of the OP derives several values from the media box:

PDRectangle pageSize = page.findMediaBox();
float pageWidth = pageSize.getWidth();
float pageHeight = pageSize.getHeight();
float lineWidth = Math.max(pageWidth, pageHeight) / 1000;
float markerRadius = lineWidth * 10;
float fontSize = Math.min(pageWidth, pageHeight) / 20;
float fontPadding = Math.max(pageWidth, pageHeight) / 100;

These seem to be chosen to be optically pleasing in relation to the page size. But the media box is not, in general, the final displayed or printed page size, the crop box is. Thus, it should be

PDRectangle pageSize = page.findCropBox();

(Actually the trim box, the intended dimensions of the finished page after trimming, might even be more apropos; the trim box defaults to the crop box. For details read here.)

This is not relevant for the given sample documents as they do not contain explicit crop box definitions, so the crop box defaults to the media box. It might be relevant for other documents, though, e.g. those the OP could not include.

Which PDPageContentStream constructor to use

The code of the OP adds a content stream to the page at hand using this constructor:

PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);

This constructor appends (first true) and compresses (second true) but unfortunately it continues in the graphics state left behind by the pre-existing content.

Details of the graphics state of importance for the observations at hand:

  • Transformation matrix - it may have been changed to scale (or rotate, skew, move ...) any new content added
  • Character spacing - it may have been changed to put any new characters added nearer to or farther from each other
  • Word spacing - it may have been changed to put any new words added nearer to or farther from each other
  • Horizontal scaling - it may have been changed to scale any new characters added
  • Text rise - it may have been changed to displace any new characters added vertically

Thus, a constructor should be chosen which also resets the graphics state:

PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true, true);

The third true tells PDFBox to reset the graphics state, i.e. to surround the former content with a save-state/restore-state operator pair.

This is relevant for the given sample documents, at least the transformation matrix is changed.

Setting and using the CalRGB color space

The OP's code sets the stroking and non-stroking color spaces to a calibrated color space:

contentStream.setStrokingColorSpace(new PDCalRGB());
contentStream.setNonStrokingColorSpace(new PDCalRGB());

Unfortunately new PDCalRGB() does not create a valid CalRGB color space object, its required WhitePoint value is missing. Thus, before selecting a calibrated color space, initialize it properly.

Thereafter the OP's code sets the colors using

contentStream.setStrokingColor(marker.color.r, marker.color.g, marker.color.b);
contentStream.setNonStrokingColor(marker.color.r, marker.color.g, marker.color.b);

These (int, int, int) overloads unfortunately use the RG and rg operators implicitly selecting the DeviceRGB color space. To not overwrite the current color space, use the (float[]) overloads with normalized (0..1) values instead.

While this is not relevant for the observed issue, it causes error messages by PDF viewers.

Calculating the width of a drawn string

The OP's code calculates the width of a drawn string using

float textWidth = font.getStringWidth(marker.id) * 0.043f;

and the OP is surprised

The * 0.043f works as an approximation for one document, but fails for the next.

There are two factors building this "magic" number:

  • As the OP has remarked the glyph coordinate space is set up in a 1/1000 of the user coordinate space and that number is in glyph space, thus a factor of 0.001.

  • As the OP has ignored he wants the width for the string using the font size he selected. But the font object has no knowledge of the current font size and returns the width for a font size of 1. As the OP selects the font size dynamically as Math.min(pageWidth, pageHeight) / 20, this factor varies. In case of the two given sample documents about 42 but probably totally different in other documents.

Positioning text

The OP's code positions the text like this starting from identity text matrices:

contentStream.moveTextPositionByAmount(
    marker.endX + marker.getXTextOffset(textWidth, fontPadding),
    marker.endY + marker.getYTextOffset(fontSize, fontPadding));

using methods getXTextOffset and getYTextOffset:

public float getXTextOffset(float textWidth, float fontPadding) {
    if (getLocation() == Location.TOP)
        return (textWidth / 2 + fontPadding) * -1;
    else if (getLocation() == Location.BOTTOM)
        return (textWidth / 2 + fontPadding) * -1;
    else if (getLocation() == Location.RIGHT)
        return 0 + fontPadding;
    else
        return (textWidth + fontPadding) * -1;
}

public float getYTextOffset(float fontSize, float fontPadding) {
    if (getLocation() == Location.TOP)
        return 0 + fontPadding;
    else if (getLocation() == Location.BOTTOM)
        return (fontSize + fontPadding) * -1f;
    else
        return fontSize / 2 * -1;
}

In case of getXTextOffset I doubt that adding fontPadding for Location.TOP and Location.BOTTOM makes sense, especially in the light of the OP's desire

The annotating text should be center aligned on the top and bottom marker

For the text to be centered it should not be shifted off-center.

The case of getYTextOffset is more difficult. The OP's code is built upon two misunderstandings: It assumes

  • that the text position selected by moveTextPositionByAmount is the lower left, and
  • that the font size is the character height.

Actually the text position is positioned on the base line, the glyph origin of the next drawn glyph will be positioned there, e.g.

Thus, the y positioned either has to be corrected to take the descent into account (for centering on the whole glyph height) or only use the ascent (for centering on the above-baseline glyph height).

And a font size does not denote the actual character height but is arranged so that the nominal height of tightly spaced lines of text is 1 unit for font size 1. "Tightly spaced" implies that some small amount of additional inter-line space is contained in the font size.

In essence for centering vertically one has to decide what to center on, whole height or above-baseline height, first letter only, whole label, or all font glyphs. PDFBox does not readily supply the necessary information for all cases but methods like PDFont.getFontBoundingBox() should help.

Question:

Black shapes are text that need to be extracted:

So far, i've extracted the text from columns, but manually, because there are only 5 (using the Rectangle class for the regions). My question is: is there a way to do so for rows since the size (height) of the Rectangles are different and manually doing it to 50+ rows would be an atrocity? More specific, can i change the rectangle according to every row's height using a function? Or any suggestion that may help?


Answer:

As proposed in comments, you can automatically recognize the table cell regions of your example PDF by parsing the vector graphics instructions of the page.

For such a task you can extend the PDFBox PDFGraphicsStreamEngine which provides abstract methods called for path building and drawing instructions.

Beware: The stream engine class I show here is specialized on recognizing table cell frame lines drawn as long, small rectangles filled with black as used in your example document. For a general solution you should at least also recognize frame lines drawn as vector graphics line segments or as stroked rectangles.

The stream engine class PdfBoxFinder

This stream engine class collects the y coordinate ranges of horizontal lines and the x coordinate ranges of vertical lines and afterwards provides the boxes of the grid defined by these coordinate ranges. In particular this means that row spans or column spans are not supported; in the case at hand this is ok as there are no such spans.

public class PdfBoxFinder extends PDFGraphicsStreamEngine {
    /**
     * Supply the page to analyze here; to analyze multiple pages
     * create multiple {@link PdfBoxFinder} instances.
     */
    public PdfBoxFinder(PDPage page) {
        super(page);
    }

    /**
     * The boxes ({@link Rectangle2D} instances with coordinates according to
     * the PDF coordinate system, e.g. for decorating the table cells) the
     * {@link PdfBoxFinder} has recognized on the current page.
     */
    public Map<String, Rectangle2D> getBoxes() {
        consolidateLists();
        Map<String, Rectangle2D> result = new HashMap<>();
        if (!horizontalLines.isEmpty() && !verticalLines.isEmpty())
        {
            Interval top = horizontalLines.get(horizontalLines.size() - 1);
            char rowLetter = 'A';
            for (int i = horizontalLines.size() - 2; i >= 0; i--, rowLetter++) {
                Interval bottom = horizontalLines.get(i);
                Interval left = verticalLines.get(0);
                int column = 1;
                for (int j = 1; j < verticalLines.size(); j++, column++) {
                    Interval right = verticalLines.get(j);
                    String name = String.format("%s%s", rowLetter, column);
                    Rectangle2D rectangle = new Rectangle2D.Float(left.from, bottom.from, right.to - left.from, top.to - bottom.from);
                    result.put(name, rectangle);
                    left = right;
                }
                top = bottom;
            }
        }
        return result;
    }

    /**
     * The regions ({@link Rectangle2D} instances with coordinates according
     * to the PDFBox text extraction API, e.g. for initializing the regions of
     * a {@link PDFTextStripperByArea}) the {@link PdfBoxFinder} has recognized
     * on the current page.
     */
    public Map<String, Rectangle2D> getRegions() {
        PDRectangle cropBox = getPage().getCropBox();
        float xOffset = cropBox.getLowerLeftX();
        float yOffset = cropBox.getUpperRightY();
        Map<String, Rectangle2D> result = getBoxes();
        for (Map.Entry<String, Rectangle2D> entry : result.entrySet()) {
            Rectangle2D box = entry.getValue();
            Rectangle2D region = new Rectangle2D.Float(xOffset + (float)box.getX(), yOffset - (float)(box.getY() + box.getHeight()), (float)box.getWidth(), (float)box.getHeight());
            entry.setValue(region);
        }
        return result;
    }

    /**
     * <p>
     * Processes the path elements currently in the {@link #path} list and
     * eventually clears the list.
     * </p>
     * <p>
     * Currently only elements are considered which 
     * </p>
     * <ul>
     * <li>are {@link Rectangle} instances;
     * <li>are filled fairly black;
     * <li>have a thin and long form; and
     * <li>have sides fairly parallel to the coordinate axis.
     * </ul>
     */
    void processPath() throws IOException {
        PDColor color = getGraphicsState().getNonStrokingColor();
        if (!isBlack(color)) {
            logger.debug("Dropped path due to non-black fill-color.");
            return;
        }

        for (PathElement pathElement : path) {
            if (pathElement instanceof Rectangle) {
                Rectangle rectangle = (Rectangle) pathElement;

                double p0p1 = rectangle.p0.distance(rectangle.p1);
                double p1p2 = rectangle.p1.distance(rectangle.p2);
                boolean p0p1small = p0p1 < 3;
                boolean p1p2small = p1p2 < 3;

                if (p0p1small) {
                    if (p1p2small) {
                        logger.debug("Dropped rectangle too small on both sides.");
                    } else {
                        processThinRectangle(rectangle.p0, rectangle.p1, rectangle.p2, rectangle.p3);
                    }
                } else if (p1p2small) {
                    processThinRectangle(rectangle.p1, rectangle.p2, rectangle.p3, rectangle.p0);
                } else {
                    logger.debug("Dropped rectangle too large on both sides.");
                }
            }
        }
        path.clear();
    }

    /**
     * The argument points shall be sorted to have (p0, p1) and (p2, p3) be the small
     * edges and (p1, p2) and (p3, p0) the long ones.
     */
    void processThinRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
        float longXDiff = (float)Math.abs(p2.getX() - p1.getX());
        float longYDiff = (float)Math.abs(p2.getY() - p1.getY());
        boolean longXDiffSmall = longXDiff * 10 < longYDiff;
        boolean longYDiffSmall = longYDiff * 10 < longXDiff;

        if (longXDiffSmall) {
            verticalLines.add(new Interval(p0.getX(), p1.getX(), p2.getX(), p3.getX()));
        } else if (longYDiffSmall) {
            horizontalLines.add(new Interval(p0.getY(), p1.getY(), p2.getY(), p3.getY()));
        } else {
            logger.debug("Dropped rectangle too askew.");
        }
    }

    /**
     * Sorts the {@link #horizontalLines} and {@link #verticalLines} lists and
     * merges fairly identical entries.
     */
    void consolidateLists() {
        for (List<Interval> intervals : Arrays.asList(horizontalLines, verticalLines)) {
            intervals.sort(null);
            for (int i = 1; i < intervals.size();) {
                if (intervals.get(i-1).combinableWith(intervals.get(i))) {
                    Interval interval = intervals.get(i-1).combineWith(intervals.get(i));
                    intervals.set(i-1, interval);
                    intervals.remove(i);
                } else {
                    i++;
                }
            }
        }
    }

    /**
     * Checks whether the given color is black'ish.
     */
    boolean isBlack(PDColor color) throws IOException {
        int value = color.toRGB();
        for (int i = 0; i < 2; i++) {
            int component = value & 0xff;
            if (component > 5)
                return false;
            value /= 256;
        }
        return true;
    }

    //
    // PDFGraphicsStreamEngine overrides
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        path.add(new Rectangle(p0, p1, p2, p3));
    }

    @Override
    public void endPath() throws IOException {
        path.clear();
    }

    @Override
    public void strokePath() throws IOException {
        path.clear();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        processPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        processPath();
    }

    @Override public void drawImage(PDImage pdImage) throws IOException { }
    @Override public void clip(int windingRule) throws IOException { }
    @Override public void moveTo(float x, float y) throws IOException { }
    @Override public void lineTo(float x, float y) throws IOException { }
    @Override public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }
    @Override public Point2D getCurrentPoint() throws IOException { return null; }
    @Override public void closePath() throws IOException { }
    @Override public void shadingFill(COSName shadingName) throws IOException { }

    //
    // inner classes
    //
    class Interval implements Comparable<Interval> {
        final float from;
        final float to;

        Interval(float... values) {
            Arrays.sort(values);
            this.from = values[0];
            this.to = values[values.length - 1];
        }

        Interval(double... values) {
            Arrays.sort(values);
            this.from = (float) values[0];
            this.to = (float) values[values.length - 1];
        }

        boolean combinableWith(Interval other) {
            if (this.from > other.from)
                return other.combinableWith(this);
            if (this.to < other.from)
                return false;
            float intersectionLength = Math.min(this.to, other.to) - other.from;
            float thisLength = this.to - this.from;
            float otherLength = other.to - other.from;
            return (intersectionLength >= thisLength * .9f) || (intersectionLength >= otherLength * .9f);
        }

        Interval combineWith(Interval other) {
            return new Interval(this.from, this.to, other.from, other.to);
        }

        @Override
        public int compareTo(Interval o) {
            return this.from == o.from ? Float.compare(this.to, o.to) : Float.compare(this.from, o.from);
        }

        @Override
        public String toString() {
            return String.format("[%3.2f, %3.2f]", from, to);
        }
    }

    interface PathElement {
    }

    class Rectangle implements PathElement {
        final Point2D p0, p1, p2, p3;

        Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }
    }

    //
    // members
    //
    final List<PathElement> path = new ArrayList<>();
    final List<Interval> horizontalLines = new ArrayList<>();
    final List<Interval> verticalLines = new ArrayList<>();
    final Logger logger = LoggerFactory.getLogger(PdfBoxFinder.class);
}

(PdfBoxFinder.java)

Example use

You can use the PdfBoxFinder like this to extract text from the table cells of the sample document located at FILE_PATH:

try (   PDDocument document = PDDocument.load(FILE_PATH) ) {
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfBoxFinder boxFinder = new PdfBoxFinder(page);
        boxFinder.processPage(page);

        PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();
        for (Map.Entry<String, Rectangle2D> entry : boxFinder.getRegions().entrySet()) {
            stripperByArea.addRegion(entry.getKey(), entry.getValue());
        }

        stripperByArea.extractRegions(page);
        List<String> names = stripperByArea.getRegions();
        names.sort(null);
        for (String name : names) {
            System.out.printf("[%s] %s\n", name, stripperByArea.getTextForRegion(name));
        }
    }
}

(ExtractBoxedText test testExtractBoxedTexts)

The start of the output:

[A1] Nr. 
crt. 

[A2] Nume şi prenume 

[A3] Titlul lucrării 

[A4] Coordonator ştiinţific 

[A5] Ora 

[B1] 1. 

[B2] SFETCU I. JESSICA-
LARISA 

[B3] Analiza fluxurilor de date twitter 

[B4] Conf. univ. dr. Frîncu Marc 
Eduard 


[B5] 8:00 

[C1] 2. 

[C2] TARBA V. IONUȚ-
ADRIAN 

[C3] Test me - rest api folosind java şi 
play framework 

[C4] Conf.univ.dr. Fortiş Teodor 
Florin 


[C5] 8:12 

The first page of the document:

Question:

hopefully you have an idea of what is going wrong with extracting a text from PDF using pdfbox 2.0.7. The result is very strange:

Using 1.8.13, the command java -jar pdfbox-app-1.8.13.jar ExtractText -sort -nonSeq test.pdf leads to

Deutsche Bank Privat- und Geschäftskunden AG

Bruttoertrag 43,80 USD 37,15 EUR
Kapitalertragsteuer (KESt) - 5,36 USD - 4,55 EUR
Solidaritätszuschlag auf KESt - 0,29 USD - 0,25 EUR
Umrechnungskurs USD zu EUR 1,1791000000
Gutschrift mit Wert 15.08.2017 32,35 EUR

Using 2.0.7, the command java -jar pdfbox-app-2.0.7.jar ExtractText -sort test.pdf leads to

aeutsche Bank mrivat- und deschäftskunden Ad

Bruttoertrag QPIUM rpa PTINR bro
hapitaäertragsteuer EhbptF - RIPS rpa - QIRR bro
poäidaritätszuschäag auf hbpt - MIOV rpa - MIOR bro
rmrechnungskurs rpa zu bro NINTVNMMMMMM
dutschrift mit tert NRKMUKOMNT POIPR bro

The debugger with java -jar pdfbox-app-2.0.7.jar PDFDebugger test.pdf shows the correct text in Root/Pages/Kids/[1]/Contents/[1] so somehow the text is read correctly but not exported correctly.

I have tried to compare the information shown in the two PDFDebugger applications but they seem rather identical to me (although I don't know where/what to look for exactly). Unfortunately, I cannot share the PDF document.

I would be happy for any kind of hint of how to solve or even only attack this problem as otherwise I cannot use the newer version of pdfbox. Thanks in advance for your time!

Here is a screenshot of the Font which is used in the document (extracted with 2.0.7). This is exactly the translation of the letters that apparently is not performed:

The entry ToUnicode says

%!PS-Adobe-3.0 Resource-CMap
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /AdHoc-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
68 beginbfchar
<0004> <0021>
<0009> <0026>
<000b> <0028>
<000c> <0029>
<000f> <002c>
<0010> <002d>
<0011> <002e>
<0012> <002f>
<0013> <0030>
<0014> <0031>
<0015> <0032>
<0016> <0033>
<0017> <0034>
<0018> <0035>
<0019> <0036>
<001a> <0037>
<001b> <0038>
<001c> <0039>
<001d> <003a>
<001e> <003b>
<0024> <0041>
<0025> <0042>
<0026> <0043>
<0027> <0044>
<0028> <0045>
<0029> <0046>
<002a> <0047>
<002b> <0048>
<002c> <0049>
<002e> <004b>
<0030> <004d>
<0031> <004e>
<0032> <004f>
<0033> <0050>
<0034> <0051>
<0035> <0052>
<0036> <0053>
<0037> <0054>
<0038> <0055>
<0039> <0056>
<003a> <0057>
<003d> <005a>
<0044> <0061>
<0045> <0062>
<0046> <0063>
<0047> <0064>
<0048> <0065>
<0049> <0066>
<004a> <0067>
<004b> <0068>
<004c> <0069>
<004d> <006a>
<004e> <006b>
<004f> <006c>
<0050> <006d>
<0051> <006e>
<0052> <006f>
<0053> <0070>
<0055> <0072>
<0056> <0073>
<0057> <0074>
<0058> <0075>
<0059> <0076>
<005a> <0077>
<005d> <007a>
<006c> <00e4>
<0081> <00fc>
<0089> <00df>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

The TextView of page 2 of PDF already shows the correct text, but then somehow these replacement tables that are shown above seem to incorrectly modify the text content before it is exported by pdfbox:

Root/Pages/Kids/[1]/Contents/[1]:
=================================
0 Tw
0 Tc
0 0 0 rg
0 0 0 RG
BT
  /F1 10 Tf
  1 0 0 1 69.449 697.11 Tm
  (Wir) Tj
  1 0 0 1 87.199 697.11 Tm
  (\374berweisen) Tj
  1 0 0 1 141.099 697.11 Tm
  (den) Tj
  1 0 0 1 160.549 697.11 Tm
  (Betrag) Tj
  1 0 0 1 192.759 697.11 Tm
  (von) Tj
  1 0 0 1 211.649 697.11 Tm
  (32,35) Tj
  1 0 0 1 239.429 697.11 Tm
  (EUR) Tj
  1 0 0 1 263.299 697.11 Tm
  (auf) Tj
  1 0 0 1 279.959 697.11 Tm
  (Ihr) Tj
  1 0 0 1 294.389 697.11 Tm
  (Konto) Tj
  1 0 0 1 323.269 697.11 Tm
  (XXXXXXX) Tj
  1 0 0 1 364.959 697.11 Tm
  (XX) Tj
  1 0 0 1 376.079 697.11 Tm
  (.) Tj
  0 G
  0 g
ET
69.449 669.448 m
69.449 669.698 l
549.921 669.698 l
549.921 669.448 l
549.921 669.198 l
69.449 669.198 l
h
f
0 0 0 rg
0 0 0 RG
BT
  /F1 6 Tf
  1 0 0 1 249.022 658.948 Tm
  (Kapitalertr\344ge) Tj
  1 0 0 1 288.016 658.948 Tm
  (sind) Tj
  1 0 0 1 300.682 658.948 Tm
  (einkommensteuerpflichtig!) Tj
  1 0 0 1 213.865 652.783 Tm
  (Diese) Tj
  1 0 0 1 230.863 652.783 Tm
  (Mitteilung) Tj
  1 0 0 1 258.187 652.783 Tm
  (wurde) Tj
  1 0 0 1 276.187 652.783 Tm
  (maschinell) Tj
  1 0 0 1 306.187 652.783 Tm
  (erstellt) Tj
  1 0 0 1 325.507 652.783 Tm
  (und) Tj
  1 0 0 1 337.177 652.783 Tm
  (wird) Tj
  1 0 0 1 349.837 652.783 Tm
  (nicht) Tj
  1 0 0 1 364.165 652.783 Tm
  (unterschrieben.) Tj
  0 G
  0 g
ET
q
  1 0 0 1 504.562 772.646 cm
  1 0 0 1 0 0 cm
  q
    0 Tw
    0 Tc
    45.36 0 0 45.36 0 0 cm
    /I0 Do
  Q
Q
0 0 0 rg
0 0 0 RG
BT
  /F1 10.5 Tf
  1 0 0 1 552.756 23.464 Tm
  (2) Tj
  1 0 0 1 558.594 23.464 Tm
  (/) Tj
  1 0 0 1 561.503 23.464 Tm
  (2) Tj
ET
Q
q
0 0 m
0 841.89 l
595.276 841.89 l
595.276 0 l
h
0 0 m
595.276 0 l
595.276 841.89 l
0 841.89 l
h
W
n
Q

1.8.13 shows:

Wir überweisen den Betrag von 32,35 EUR auf Ihr Konto XXXXXXX XX.
Kapitalerträge sind einkommensteuerpflichtig!
Diese Mitteilung wurde maschinell erstellt und wird nicht unterschrieben.
2/2

2.0.7 shows:

tir überweisen den Betrag von POIPR bro auf fhr honto XXXXXXX XX
hapitaäerträge sind einkommensteuerpfäichtig!
aiese jitteiäung wurde maschineää ersteäät und wird nicht unterschriebenK
O/O

This is the file that you were asking for: https://wetransfer.com/downloads/214674449c23713ee481c5a8f529418320170827201941/b2bea6


Answer:

The information about the font in question in your PDF are contradictory and partially broken. Depending on how some software reacts to that it may or may not extract the text correctly.


On the one hand the font has an Encoding value WinAnsiEncoding. This is ok and matches what we see in the content stream, a one-byte encoding covering many of the ANSI codes.

On the other hand we have a ToUnicode map which implies that the underlying encoding is some two-byte encoding (it has a code space range <0000> <ffff>), and even if one ignores the two-byte nature, it has mappings which in particular map digit ANSI codes to uppercase letters, uppercase letter ANSI codes to other lowercase letters, and the lowercase 'l' ANSI code to the Unicode value of 'ä'.

When extracting text, PDFBox 2.0.x seems to follow the broken ToUnicode map (interpreting the two-byte codes in the tabel as one-byte codes, ignoring the upper 0) where possible (resulting in garbage) and else interpret the character code as ANSI (resulting in proper text). PDF 1.8.x seems to have ignored the ToUnicode map, and so does Adobe Reader.


Actually it looks like the ToUnicode map has been made for a font using Identity-H encoding.


If you are confronted with such a PDF and need to extract its text, you can pre-process it and remove the ToUnicode entries; thereafter text extraction should return proper text. E.g.

PDDocument document = PDDocument.load(SOURCE);

for (int pageNr = 0; pageNr < document.getNumberOfPages(); pageNr++)
{
    PDPage page = document.getPage(pageNr);
    PDResources resources = page.getResources();
    removeToUnicodeMaps(resources);
}

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

(ExtractText test method testNoToUnicodeTest2)

using helper methods

void removeToUnicodeMaps(PDResources pdResources) throws IOException
{
    COSDictionary resources = pdResources.getCOSObject();

    COSDictionary fonts = asDictionary(resources, COSName.FONT);
    if (fonts != null)
    {
        for (COSBase object : fonts.getValues())
        {
            while (object instanceof COSObject)
                object = ((COSObject)object).getObject();
            if (object instanceof COSDictionary)
            {
                COSDictionary font = (COSDictionary)object;
                font.removeItem(COSName.TO_UNICODE);
            }
        }
    }

    for (COSName name : pdResources.getXObjectNames())
    {
        PDXObject xobject = pdResources.getXObject(name);
        if (xobject instanceof PDFormXObject)
        {
            PDResources xobjectPdResources = ((PDFormXObject)xobject).getResources();
            removeToUnicodeMaps(xobjectPdResources);
        }
    }
}

COSDictionary asDictionary(COSDictionary dictionary, COSName name)
{
    COSBase object = dictionary.getDictionaryObject(name);
    return object instanceof COSDictionary ? (COSDictionary) object : null;
}

(from ExtractText)

You should execute this pre-processing as early as possible after loading the document to prevent the fonts including the wrong ToUnicode mappings to be read into the document font cache.

Question:

I am using the PDFBox library and currently I don't understand the behavior of the moveTextPositionByAmount(X,Y) method.

Here is the code I am using:

[...]    
int i = 0;
        for (InventoryItem currInvItem : invList) {
            try {
                content.moveTextPositionByAmount(textPositionX, textPositionY);
                content.drawString(currInvItem.toString());
                textPositionY = textPositionY+10;
                i++;
                if (i > 10) {
                    break;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
[...]

This simply goes through a list and prints the elements into my PDF file. I expected the moveTextPositionByAmount(X,Y) method to simply move the cursor to another position by some amount in a linear way.

Unfortunately this is not the case and results in a PDF file that has the text in it but the distance between text elements increases with every iteration of the loop even though I just increase my variable textPositionY by 10. The increase in distance between elements is best described with something like distance = e^x

Here is a simplified example output:


Answer:

moveTextPositionByAmount() does not take absolute coordinates as parameters. It's relative positioning.

Let's say you start at coordinates 0 0 and textPositionY was initialized with 10. Your first call of the method would move the cursor to 0 10.

The next iteration raises textPositionY to 20. You are already at 0 10 but move the cursor by 20, so you are at 0 30.

3rd: 0 60 4th: 0 100 5th: 0 150

If you want equal distances then don't increase textPositionY and maybe rename the variable to lineGap as it is not a position.

Question:

I'm trying to extract text from PDF using Apache PDFBox 1.8.4 - my code is bellow:

public static void main(String[] args) throws Exception {

        PDDocument pdfDocument = PDDocument.load(new File("rep.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String s =  stripper.getText(pdfDocument);
        System.out.println(s);
        pdfDocument.close();
    }

pdf which want to convert: https://www.dropbox.com/s/t35rr23v4383yvt/Form-V-report.pdf?dl=0

but got such charecters:

!"#$%&'()*$+,)!'-,./+/
0+12)3$#'(,,)451#+('1)65+7(,+'(/
!"#$%&'(
)*+,-.##(',/$.0
123.4.5,67,,89:;+
<3$'(=,>:++?,*99%@AB)

Any solutions?

In Advance - Thanks.


Answer:

Adobe has integrated PDF obfuscation which can be enabled by the creator of PDFs. I can't recall exactly how it works, but you will find similar issues if you use any of the online PDF text-extraction tools, or if you try and copy and paste the text.

You likely need to either:

A) Ask for a copy without this enabled

or

B) Need to reverse engineer how it is done, and use that knowledge to reverse it.

I have a feeling A is the right answer

Question:


Answer:

Instead of

PDPageContentStream contentStream = new PDPageContentStream(document, page);

use

PDPageContentStream contentStream = new PDPageContentStream(document, page, AppendMode.APPEND, true, true))

This way your new content stream is appended.

However I expect another problem, you may want the "highlight" to be transparent. Have a look at this answer.

Question:

I have a program that create TextFields inside a PDF-file so it can be used as a form. I would like to have the text I write in the TextFields I created to be centered though. How is that possible? My code currently looks like this:

PDTextField textBox = new PDTextField(acroForm);
textBox.setPartialName("Field " + j + " " + i);
defaultAppearanceString = "/Helv 8 Tf 0 g"; //Textsize: 8
textBox.setDefaultAppearance(defaultAppearanceString);
acroForm.getFields().add(textBox);

PDAnnotationWidget widget = textBox.getWidgets().get(0);
PDRectangle rect = new PDRectangle(inputField.getX(), inputField.getY(), inputField.getWidth(), inputField.getHeight());
widget.setRectangle(rect);
widget.setPage(page);
widget.setPrinted(true);
page.getAnnotations().add(widget);

and I thought of an easy function to align text like this:

textBox.setAlignment(Alignment.CENTER);

but I didn't find it.


Answer:

Use the Q flag:

textBox.setQ(PDTextField.QUADDING_CENTERED);

Other possible values are QUADDING_RIGHT and QUADDING_LEFT (which is the default).

Question:

I'm trying to use PDFBox 2.0 for text extraction. I would like to get information on the font size of specific characters and the position rectangle of that character on the page. I've implemented this in PDFBox 1.6 using a PDFTextStripper:

    PDFParser parser = new PDFParser(is);
    try{
        parser.parse();
    }catch(IOException e){

    }
    COSDocument cosDoc = parser.getDocument();
    PDDocument pdd = new PDDocument(cosDoc);
    final StringBuffer extractedText = new StringBuffer();
    PDFTextStripper textStripper = new PDFTextStripper(){
        @Override
        protected void processTextPosition(TextPosition text) {
            extractedText.append(text.getCharacter());
            logger.debug("text position: "+text.toString());
        }
    };
    textStripper.setSuppressDuplicateOverlappingText(false);
    for(int pageNum = 0;pageNum<pdd.getNumberOfPages();pageNum++){
        PDPage page = (PDPage) pdd.getDocumentCatalog().getAllPages().get(pageNum);
        textStripper.processStream(page, page.findResources(), page.getContents().getStream());
    }
    pdd.close();

But in the 2.0 version of PDFBox, the processStream method has been removed. How can I achieve the same with PDFBox 2.0?

I've tried the following:

        PDDocument pdd = PDDocument.load(inputStream);
        PDFTextStripper textStripper = new PDFTextStripper(){
            @Override
            protected void processTextPosition(TextPosition text){
                int pos = PDFdocument.length();
                String textadded = text.getUnicode();
                Range range = new Range(pos,pos+textadded.length());
                int pagenr = this.getCurrentPageNo();
                Rectangle2D rect = new Rectangle2D.Float(text.getX(),text.getY(),text.getWidth(),text.getHeight());
            }
        };
        textStripper.setSuppressDuplicateOverlappingText(false);
        for(int pageNum = 0;pageNum<pdd.getNumberOfPages();pageNum++){
            PDPage page = (PDPage) pdd.getDocumentCatalog().getPages().get(pageNum);
            textStripper.processPage(page);
        }
        pdd.close();

The processTextPosition(TextPosition text) method does not get called. Any suggestions would be very welcome.


Answer:

The DrawPrintTextLocations example, suggested by @tilmanhausherr, provided the solution to my problem.

The parser is started using the following code (the inputStream is the input stream from the URL of the PDF file):

    PDDocument pdd = null;
    try {
        pdd = PDDocument.load(inputStream);
        PDFParserTextStripper stripper = new PDFParserTextStripper(PDFdocument,pdd);
        stripper.setSortByPosition(true);
        for (int i=0;i<pdd.getNumberOfPages();i++){
            stripper.stripPage(i);
        }
    } catch (IOException e) {
        // throw error
    } finally {
        if (pdd!=null) {
            try {
                pdd.close();
            } catch (IOException e) {

            }
        }
    }

This code uses a custom subclass of PDFTextStripper:

class PDFParserTextStripper extends PDFTextStripper {

    public PDFParserTextStripper() throws IOException {
        super();
    }


    public void stripPage(int pageNr) throws IOException {
        this.setStartPage(pageNr+1);
        this.setEndPage(pageNr+1);
        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        writeText(document,dummy); // This call starts the parsing process and calls writeString repeatedly.
    }



    @Override
    protected void writeString(String string,List<TextPosition> textPositions) throws IOException {
        for (TextPosition text : textPositions) {
            System.out.println("String[" + text.getXDirAdj()+","+text.getYDirAdj()+" fs="+text.getFontSizeInPt()+" xscale="+text.getXScale()+" height="+text.getHeightDir()+" space="+text.getWidthOfSpace()+" width="+text.getWidthDirAdj()+" ] "+text.getUnicode());
        }
    }

}

Question:

import java.io.IOException;

import javax.swing.text.BadLocationException;

import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSFloat;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.action.PDAnnotationAdditionalActions;
import org.apache.pdfbox.pdmodel.interactive.action.type.PDActionJavaScript;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDTextbox;
import org.junit.Test;

public class TestPDTextbox {
    @Test
    public void Sample1 () throws IOException, COSVisitorException, BadLocationException {


        PDDocument doc = new PDDocument();
        PDPage page = new PDPage();
        doc.addPage(page);   

        COSDictionary acroFormDict = new COSDictionary(); 
//        acroFormDict.setBoolean(COSName.getPDFName("NeedAppearances"), true);
        acroFormDict.setItem(COSName.getPDFName("Fields"), new COSArray());

        PDAcroForm acroForm = new PDAcroForm(doc, acroFormDict);
        doc.getDocumentCatalog().setAcroForm(acroForm);

        COSDictionary cosDict1 = new COSDictionary();
        COSArray rect1 = new COSArray();
        rect1.add(new COSFloat(100));
        rect1.add(new COSFloat(700));
        rect1.add(new COSFloat(200));
        rect1.add(new COSFloat(750));

        cosDict1.setItem(COSName.RECT, rect1);
        cosDict1.setItem(COSName.FT, COSName.getPDFName("Tx")); // Field Type
        cosDict1.setItem(COSName.TYPE, COSName.ANNOT);
        cosDict1.setItem(COSName.SUBTYPE, COSName.getPDFName("Widget"));
        cosDict1.setItem(COSName.T, new COSString("tx1"));
        cosDict1.setItem(COSName.DA, new COSString("/Helv 7 Tf 0 g"));
        cosDict1.setItem(COSName.V, new COSString("Test Value1"));

        PDTextbox textbox = new PDTextbox(doc.getDocumentCatalog().getAcroForm(), cosDict1);

//      textbox.setValue("Test Value");

        page.getAnnotations().add(textbox.getWidget());
        acroForm.getFields().add(textbox);

         doc.save("C:\\PDF\\SampleTextbox.pdf");
         doc.close();
    }
}

Issue#1 I have created one text field as shown in above code and tried to set value using textbox.setValue("Test Value"); method but it is giving error as shown below :

java.io.IOException: Error: Don't know how to calculate the position for non-simple fonts
    at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.getTextPosition(PDAppearance.java:1037)
    at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.insertGeneratedAppearance(PDAppearance.java:558)
    at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.setAppearanceValue(PDAppearance.java:338)
    at org.apache.pdfbox.pdmodel.interactive.form.PDVariableText.setValue(PDVariableText.java:131)
    at sample.pdfbox.example.TestPDTextbox.Sample1(TestPDTextbox.java:54)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Issue#2

In order to fix issue#1, If I set value of textBox using cosDictionary property i.e. cosDict1.setItem(COSName.V, new COSString("Test Value1"));

Then in Adobe Reader, value of textBox is not populated properly. I have to click on textbox and then only value appears, and once I move out from the field, value again gets invisible.

Issue#3

In order to fix issue#2, I need to set needAppearances flag to true as shown below, and after that field value is displayed properly in PDF. But after this solution, I`m not able to extract/parse PDF field back once user changes field value and we again parse this PDF.

Note:- This issue exists in Adobe Reader, here while opening PDF it gives some message too like fixing form fields up. And once I save PDF and tried to parse acroform fields, all fields are found to be reset or null. None of the field name or field values can be extracted.

So using acroFormDict.setBoolean(COSName.getPDFName("NeedAppearances"), true); in code seems risky and it creates other issue in PDF parsing so it cannot be used.

COSDictionary acroFormDict = new COSDictionary(); 
        acroFormDict.setBoolean(COSName.getPDFName("NeedAppearances"), true);
        acroFormDict.setItem(COSName.getPDFName("Fields"), new COSArray());

        PDAcroForm acroForm = new PDAcroForm(doc, acroFormDict);
        doc.getDocumentCatalog().setAcroForm(acroForm);

I think, I need to set PDAppearanceDictionary for text fields but I don`t know to do that and whether I need to set for each field or at acroform level.

Please help me with this issue how should I resolve. I`m using PDFBOX version 1.8.10.


Answer:

In above question, I fixed Issue#1 by adding page resources to acroform and used proper Default Appearance string for text. Now I don`t event require to set needsAppearance flag to true.

        PDFont font = PDType1Font.HELVETICA;
        PDResources res = new PDResources();
        String fontName = res.addFont(font);
        String defaultAppearance = "/"+fontName+" 7 Tf 0 g";

        COSDictionary acroFormDict = new COSDictionary(); 
        acroFormDict.setBoolean(COSName.getPDFName("NeedAppearances"), false);
        acroFormDict.setItem(COSName.getPDFName("Fields"), new COSArray());
        acroFormDict.setItem(COSName.DA, new COSString(defaultAppearance));

        PDAcroForm acroForm = new PDAcroForm(doc, acroFormDict);
        acroForm.setDefaultResources(res);

Check entire corrected code below:

import java.io.IOException;

import javax.swing.text.BadLocationException;

import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSFloat;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDTextbox;
import org.junit.Test;

public class TestPDTextbox {
    @Test
    public void Sample1 () throws IOException, COSVisitorException, BadLocationException {


        PDDocument doc = new PDDocument();
        PDPage page = new PDPage();
        doc.addPage(page);   

        PDFont font = PDType1Font.HELVETICA;
        PDResources res = new PDResources();
        String fontName = res.addFont(font);
        String defaultAppearance = "/"+fontName+" 7 Tf 0 g";

        COSDictionary acroFormDict = new COSDictionary(); 
        acroFormDict.setBoolean(COSName.getPDFName("NeedAppearances"), false);
        acroFormDict.setItem(COSName.getPDFName("Fields"), new COSArray());
        acroFormDict.setItem(COSName.DA, new COSString(defaultAppearance));

        PDAcroForm acroForm = new PDAcroForm(doc, acroFormDict);
        acroForm.setDefaultResources(res);

        doc.getDocumentCatalog().setAcroForm(acroForm);

        COSDictionary cosDict1 = new COSDictionary();
        COSArray rect1 = new COSArray();
        rect1.add(new COSFloat(100));
        rect1.add(new COSFloat(700));
        rect1.add(new COSFloat(200));
        rect1.add(new COSFloat(750));

        cosDict1.setItem(COSName.RECT, rect1);
        cosDict1.setItem(COSName.FT, COSName.getPDFName("Tx")); // Field Type
        cosDict1.setItem(COSName.TYPE, COSName.ANNOT);
        cosDict1.setItem(COSName.SUBTYPE, COSName.getPDFName("Widget"));
        cosDict1.setItem(COSName.T, new COSString("tx1"));
        cosDict1.setItem(COSName.DA, new COSString(defaultAppearance));
//        cosDict1.setItem(COSName.V, new COSString("Test Value1"));

        PDTextbox textbox = new PDTextbox(doc.getDocumentCatalog().getAcroForm(), cosDict1);

      textbox.setValue("Test Value");

        page.getAnnotations().add(textbox.getWidget());
        acroForm.getFields().add(textbox);

         doc.save("C:\\PDF\\SampleTextbox.pdf");
         doc.close();
    }
}

Question:

I have a text to draw on pdf, something like 500 €/hour. I am using PdfBox-Android library.

I was trying to write above string as follows,

pageContentStream.drawString("500 " + Html.fromHtml(getString(R.string.euro)).toString() + "/hour");

where euro is defined in strings.xml as

<string name="euro">(&#8364;)</string>
<string name="pound">(&#163;)</string>

With above code PdfBox-Androidis writing some gibberish characters.

I found one solution here to write using pdfbox, which is working perfectly.

My question is how to write the text next to sign in one go..?

Do I need to write first, then move text next to it and then write remaining text? I don't feel that would be a correct solution.


Answer:

You need to pass the HTML code of the currency symbol

Html.fromHtml((String) currency_symbol).toString()

Html.fromHtml((String) "&#8364;").toString() //for euro

Html.fromHtml((String) "&#163;").toString()  //for pound

Question:

I've started to work with PDType0Font recently (we've used PDType1Font.HELVETICA but needed unicode support) and I'm facing an error where i'm adding lines to the file using PDPageContentStream but PDFTextStripper.getText doesn't get the updated file contents.

I'm loading the font:

PDType0Font.load(document, fontFile)

And creating the contentStream as follows:

PDPageContentStream(document, pdPage, PDPageContentStream.AppendMode.PREPEND, false)

my function that adds content to the pdf is:

  private fun addTextToContents(contentStream: PDPageContentStream, txtLines: List<String>, x: Float, y: Float, pdfFont: PDFont, fontSize: Float, maxWidth: Float) {
     contentStream.beginText()
     contentStream.setFont(pdfFont, fontSize)
     contentStream.newLineAtOffset(x, y)
     txtLines.forEach { txt ->
       contentStream.showText(txt)
       contentStream.newLineAtOffset(0.0F, -fontSize)
     }
     contentStream.endText()
     contentStream.close()

When i'm trying to read the content of the file using PDFTextStripper.getText i'm getting the file before the changes. However, if I'm adding document.save before reading to PDFTextStripper, it works.

      val txt: String = PDFTextStripper().getText(doc) //not working

      doc.save(//File)
      val txt: String = PDFTextStripper().getText(doc) //working

if I'm using PDType1Font.HELVETICA in

contentStream.setFont(pdfFont, fontSize)

Everything is working without any problems and without saving the doc before reading the text.

I'm suspecting that the issue is with the code in PDPageContentStream.showTextInternal():

        // Unicode code points to keep when subsetting
    if (font.willBeSubset())
    {
        int offset = 0;
        while (offset < text.length())
        {
            int codePoint = text.codePointAt(offset);
            font.addToSubset(codePoint);
            offset += Character.charCount(codePoint);
        }
    }

This is the only thing that is not the same when using PDType0Font with embedsubsets and PDType1Font.

Can someone help with this? What am I doing wrong?


Answer:

Your question, in particular the quoted code, already hints at the answer to your question:

When using a font that will be subset (font.willBeSubset() == true), the associated PDF objects are unfinished until the file is saved. Text extraction on the other hand needs the finished PDF objects to properly work. Thus, don't apply text extraction to a document that is still being created and uses fonts that will be subset.

You describe your use case as

for our unit tests, we are adding text (mandatory text for us) to the document and then using PDFTextStripper we are validating that the file has the proper fields.

As Tilman proposes: Then it would make more sense to save the PDF, and then to reload. That would be a more realistic test. Not saving is cutting corners IMHO.

Indeed, in unit tests you should first produce the final PDF as it will be sent out (i.e. saving it, either to the file system or to memory), then reload that file, and test only this reloaded document.

Question:

I know there many ways to do bold text in PDF, most common are "bold" keywords in TextPosition.getFont(). However, in attached document I was not able to find why that bold text is actually bold:

capture of pdf

Visually it looks like each character is duplicated twice, however, I don't see that in TextStripper.writeString Is there anything else can make text bold? thanks in advance!


Answer:

Your file uses text rendering mode 2 (fill and stroke, aka RenderingMode.FILL_STROKE). This simulates bold. You can get the current mode by calling getGraphicsState().getTextState().getRenderingMode() in a class that extends the stripper.

Question:

I'm trying to convert a pdf to png file using pdfbox. Unfortunately in the result I get weird red areas in some places of the output. I'm not sure what's the problem. It's a problem with only some of the pdf files.

Here's some of the code that I'm using:

    public static BufferedImage generateFromPdf(String ref, InputStream stream, int pageIndex, PreviewMode mode) throws IOException {
        PDDocument doc = null;
        try (InputStream buffered = new BufferedInputStream(stream)) {
            doc = PDDocument.load(buffered, PDF_LOADING_MEMORY_SETTING);
            if (pageIndex > doc.getNumberOfPages()) {
                return null;
            }
            PDFRenderer renderer = new PDFRenderer(doc);
            return rasterizePdfBox(ref, pageIndex, renderer, mode);
        } finally {
            if (doc != null) {
                doc.close();
            }
        }
    }

and then:

    private static BufferedImage rasterizePdfBox(String ref, int pageIndex, PDFRenderer renderer, PreviewMode mode) throws IOException {
        Future<BufferedImage> result = executorService.submit(() -> {
            LOGGER.info(String.format("Generate preview for ref: %s, page: %s, mode: %s ", ref, pageIndex, mode.name()));
            return renderer.renderImageWithDPI(pageIndex - 1, mode.getDpi(), ImageType.RGB);
        });

        try {
            return result.get();
        } catch (InterruptedException | ExecutionException e) {
            LOGGER.error(String.format("Error when generating preview: %s", e.getMessage()));
            Thread.currentThread().interrupt();
            throw new IOException(e.getMessage());
        }
    }

So far I've only figured out that the places which are red in the output are blank when I open them in Master PDF editor on linux. They seem normal though when I open them with Document Viewer.

Some hints: - the pdfs with problems have been scanned. I can select text around the working parts but not at the places that have red overlay over them. Maybe it's something to do with OCR issues? - if I use the linux tool convert not-working-pdf.pdf converted.pdf and then try to convert this file to png, then the issue is not there anymore.

Here's an example file: https://ufile.io/3or9l

pdfbox version: 2.0.13


Answer:

This was a PDFBox bug and the cause was a bitonal image with a mask, which is unusual. There is only one color element in the raster so only "R" is applied instead of all 3 of the RGB destination. Because of that, white appeared as red.

More details about this bug in issue PDFBOX-4470, it will be fixed in release 2.0.14. Until then, you can work with a snapshot.

Question:

Requirement: Attached the screen shot. I have to write 2 lines of text in the pdf and then draw a line and then again start writing some texts.

Accordingly, my algorithm goes by:

contentStream = new PDPageContentStream(doc, page);
contentStream.setFont(font, fontSize);
contentStream.beginText();

Created a new PDPageContentStream and triggered the function beginText(). I am able to write the upper text portion as displayed in the image attached.

Given below is the following lines of code for the upper text and the lines:

        contentStream.showText("Entry Form – Header");
        yCordinate -= fontHeight;  //This line is to track the yCordinate
        contentStream.newLineAtOffset(0, -leading);
        yCordinate -= leading;
        contentStream.showText("Date Generated: " + dateFormat.format(date));
        yCordinate -= fontHeight;
        contentStream.newLineAtOffset(0, -leading);
        yCordinate -= leading;
        contentStream.endText(); // End of text mode

I had to end this text mode because the below 3 lines of code (which draws a line) won't execute in text mode:

            contentStream.moveTo(startX, yCordinate);
            contentStream.lineTo(endX, yCordinate);
            contentStream.stroke();        

Now After this line of code, if I write :

contentStream.beginText();
contentStream.showText("Name: XXXXX");

The Name is displayed at the below left cornor of the page. I want this line to be next after the line drawn as displayed in the below image.

Any help will be appreciated.


Answer:

Unfortunately the code in the question is rather incomplete and does not show in particular the initialization of the text matrix in each text object and also has many undefined variables.

Thus, here a piece of code as an example that results in a text - line - text output:

PDFont font = PDType1Font.HELVETICA;
float fontSize = 14;
float fontHeight = fontSize;
float leading = 20;
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy.MM.dd");
Date date = new Date();

PDDocument doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage(page);

PDPageContentStream contentStream = new PDPageContentStream(doc, page);
contentStream.setFont(font, fontSize);

float yCordinate = page.getCropBox().getUpperRightY() - 30;
float startX = page.getCropBox().getLowerLeftX() + 30;
float endX = page.getCropBox().getUpperRightX() - 30;

contentStream.beginText();
contentStream.newLineAtOffset(startX, yCordinate);
contentStream.showText("Entry Form – Header");
yCordinate -= fontHeight;  //This line is to track the yCordinate
contentStream.newLineAtOffset(0, -leading);
yCordinate -= leading;
contentStream.showText("Date Generated: " + dateFormat.format(date));
yCordinate -= fontHeight;
contentStream.endText(); // End of text mode

contentStream.moveTo(startX, yCordinate);
contentStream.lineTo(endX, yCordinate);
contentStream.stroke();
yCordinate -= leading;

contentStream.beginText();
contentStream.newLineAtOffset(startX, yCordinate);
contentStream.showText("Name: XXXXX");
contentStream.endText();

contentStream.close();
doc.save("textLineText.pdf");

(TextAndGraphics.java test testDrawTextLineText)

This code results in:

If you want different distances, you'll have to adapt the yCordinate -= ... lines before and after the drawing of the graphical line.

Question:

Link to pdf

When I try to extract the text from the pdf above, I get a mixture of text that was invisible in the evince viewer as well as text that was visible. In addition, some of the desired text is missing characters that were not missing in the viewer, such as, the 'S' in 'FALCONS' and the many missing '½' characters. I believe this is due to interference from the invisible text because when highlighting the pdf in the viewer, the invisible text can be seen overlapping visible text.

Is there a way to remove the invisible text? Or is there another solution?

Code:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


public class App {

    public static String getPdfText(String pdfPath) throws IOException {
        File file = new File(pdfPath);
        PDDocument document = null;
        PDFTextStripper textStripper = null;
        String text = null;

        try {
            document = PDDocument.load(file);
            textStripper = new PDFTextStripper();
            textStripper.setEndPage(1);
            text =  textStripper.getText(document);
        } catch (IOException e) {
            throw new IOException("Could not load file and strip text.", e);
        } finally {
            try {
                if (document != null)
                    document.close();
            } catch (IOException e) {
                System.out.println("Could not close document");
            }
        }

        return text;
    }

    public static void main(String[] args) {
        String filename = "RevTeaser09072016.pdf";
        String text = null;

        try {
            text = getPdfText(filename);
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

        System.out.println(text);
    }
}

Output (bold text is the desired text):

145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
Selections
Number of Teams
Amount Bet
REVERSE tEaSER caRd
mark box as shown 
 denotes home team
PRO FOOTBALL - THURSDAY,  NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:00 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:00 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:00 8 CARDINALS17– ½ 3+ ½
9 BUCCANEERS  PM1:00 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:00 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:00 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:00 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:00 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEELERS ★7– ½ 6– ½
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
Selections
Number of Teams
Amount Bet
REVERSE tEaSER caRd
mark box as hown 
 denotes home team
PRO FOOTBALL - THURSDAY,  NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:00 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:00 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:00 8 CARDINALS17– ½ 3+ ½
9 BUCCANEERS  PM1:00 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:00 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:00 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:00 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:00 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEEL RS ★7– ½ 6– ½
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
14
160
41
15715 156154150 153149 152148 51147
142
158
50
146
S lections
Number of Teams
Amount Bet

ark box as sho n 
 denotes home team
PRO F OTBALL - THURSDAY, NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO F OTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:0 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:0 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:0 8 CARDINALS17– ½ 3+ ½
9 BU CANEERS  PM1:0 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:0 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:0 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:0 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:0 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEELERS ★7– ½ 6– ½
PRO F OTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,0
MARK BOX AS SHOWN 
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
 1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
 PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
  FALCON      - 9  1:00p 4 BUCCANEERS  - 4½
 5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
 7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
 9 BENGALS - 9½ 1:00p 10 JETS  - 4½
 11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
 13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
 15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
 17 TEXANS  - 14  1:00p 18 BEARS + ½
 19 PACKERS - 12  1:00p 20 JAGUARS  - 1½
 21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
 23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
 25 COLTS     - 10½ 4:25p 26 LIONS - 3½
 27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
 PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
 29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
 31 RAMS  espn  - 9  10:20p 32 49ERS  - 4½

Answer:

The invisible text in the OP's sample PDF mostly is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, we have to consider path related instructions during text extraction to ignore that invisible text.

Unfortunately call backs designed for these instructions are not declared in PDFTextStripper or its parent classes LegacyPDFStreamEngine and PDFStreamEngine.

But they are declared in the other major PDFStreamEngine subclass PDFGraphicsStreamEngine, and they are sensibly implemented in PageDrawer.

To make use of this we, therefore, can copy & paste & adapt the PageDrawer implementation into a subclass of PDFTextStripper, e.g. like this:

public class PDFVisibleTextStripper extends PDFTextStripper {
    public PDFVisibleTextStripper() throws IOException {
        addOperator(new AppendRectangleToPath());
        addOperator(new ClipEvenOddRule());
        addOperator(new ClipNonZeroRule());
        addOperator(new ClosePath());
        addOperator(new CurveTo());
        addOperator(new CurveToReplicateFinalPoint());
        addOperator(new CurveToReplicateInitialPoint());
        addOperator(new EndPath());
        addOperator(new FillEvenOddAndStrokePath());
        addOperator(new FillEvenOddRule());
        addOperator(new FillNonZeroAndStrokePath());
        addOperator(new FillNonZeroRule());
        addOperator(new LineTo());
        addOperator(new MoveTo());
        addOperator(new StrokePath());
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
        Vector end = new Vector(start.getX() + text.getWidth(), start.getY());

        PDGraphicsState gs = getGraphicsState();
        Area area = gs.getCurrentClippingPath();
        if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
            super.processTextPosition(text);
    }

    private GeneralPath linePath = new GeneralPath();

    void deleteCharsInPath() {
        for (List<TextPosition> list : charactersByArticle) {
            List<TextPosition> toRemove = new ArrayList<>();
            for (TextPosition text : list) {
                Matrix textMatrix = text.getTextMatrix();
                Vector start = textMatrix.transform(new Vector(0, 0));
                Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
                if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
                    toRemove.add(text);
                }
            }
            if (toRemove.size() != 0) {
                System.out.println(toRemove.size());
                list.removeAll(toRemove);
            }
        }
    }

    public final class AppendRectangleToPath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x = (COSNumber) operands.get(0);
            COSNumber y = (COSNumber) operands.get(1);
            COSNumber w = (COSNumber) operands.get(2);
            COSNumber h = (COSNumber) operands.get(3);

            float x1 = x.floatValue();
            float y1 = y.floatValue();

            // create a pair of coordinates for the transformation
            float x2 = w.floatValue() + x1;
            float y2 = h.floatValue() + y1;

            Point2D p0 = context.transformedPoint(x1, y1);
            Point2D p1 = context.transformedPoint(x2, y1);
            Point2D p2 = context.transformedPoint(x2, y2);
            Point2D p3 = context.transformedPoint(x1, y2);

            // to ensure that the path is created in the right direction, we have to create
            // it by combining single lines instead of creating a simple rectangle
            linePath.moveTo((float) p0.getX(), (float) p0.getY());
            linePath.lineTo((float) p1.getX(), (float) p1.getY());
            linePath.lineTo((float) p2.getX(), (float) p2.getY());
            linePath.lineTo((float) p3.getX(), (float) p3.getY());

            // close the subpath instead of adding the last line so that a possible set line
            // cap style isn't taken into account at the "beginning" of the rectangle
            linePath.closePath();
        }

        @Override
        public String getName() {
            return "re";
        }
    }

    public final class StrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.reset();
        }

        @Override
        public String getName() {
            return "S";
        }
    }

    public final class FillEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "f*";
        }
    }

    public class FillNonZeroRule extends OperatorProcessor {
        @Override
        public final void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "f";
        }
    }

    public final class FillEvenOddAndStrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "B*";
        }
    }

    public class FillNonZeroAndStrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "B";
        }
    }

    public final class ClipEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            getGraphicsState().intersectClippingPath(linePath);
        }

        @Override
        public String getName() {
            return "W*";
        }
    }

    public class ClipNonZeroRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            getGraphicsState().intersectClippingPath(linePath);
        }

        @Override
        public String getName() {
            return "W";
        }
    }

    public final class MoveTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 2) {
                throw new MissingOperandException(operator, operands);
            }
            COSBase base0 = operands.get(0);
            if (!(base0 instanceof COSNumber)) {
                return;
            }
            COSBase base1 = operands.get(1);
            if (!(base1 instanceof COSNumber)) {
                return;
            }
            COSNumber x = (COSNumber) base0;
            COSNumber y = (COSNumber) base1;
            Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
            linePath.moveTo(pos.x, pos.y);
        }

        @Override
        public String getName() {
            return "m";
        }
    }

    public class LineTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 2) {
                throw new MissingOperandException(operator, operands);
            }
            COSBase base0 = operands.get(0);
            if (!(base0 instanceof COSNumber)) {
                return;
            }
            COSBase base1 = operands.get(1);
            if (!(base1 instanceof COSNumber)) {
                return;
            }
            // append straight line segment from the current point to the point
            COSNumber x = (COSNumber) base0;
            COSNumber y = (COSNumber) base1;

            Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());

            linePath.lineTo(pos.x, pos.y);
        }

        @Override
        public String getName() {
            return "l";
        }
    }

    public class CurveTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 6) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x1 = (COSNumber) operands.get(0);
            COSNumber y1 = (COSNumber) operands.get(1);
            COSNumber x2 = (COSNumber) operands.get(2);
            COSNumber y2 = (COSNumber) operands.get(3);
            COSNumber x3 = (COSNumber) operands.get(4);
            COSNumber y3 = (COSNumber) operands.get(5);

            Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
            Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "c";
        }
    }

    public final class CurveToReplicateFinalPoint extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x1 = (COSNumber) operands.get(0);
            COSNumber y1 = (COSNumber) operands.get(1);
            COSNumber x3 = (COSNumber) operands.get(2);
            COSNumber y3 = (COSNumber) operands.get(3);

            Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "y";
        }
    }

    public class CurveToReplicateInitialPoint extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x2 = (COSNumber) operands.get(0);
            COSNumber y2 = (COSNumber) operands.get(1);
            COSNumber x3 = (COSNumber) operands.get(2);
            COSNumber y3 = (COSNumber) operands.get(3);

            Point2D currentPoint = linePath.getCurrentPoint();

            Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "v";
        }
    }

    public final class ClosePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.closePath();
        }

        @Override
        public String getName() {
            return "h";
        }
    }

    public final class EndPath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.reset();
        }

        @Override
        public String getName() {
            return "n";
        }
    }
}

(PDFVisibleTextStripper)

Please make sure you use the inner operator classes in the PDFVisibleTextStripper constructor, not the classes used by PageDrawer with the same name. To make sure simply follow the link under the code.

This reduces the output to

REVERSE tEaSER caRd
500
elections
er of Teams
t Bet
1,000
MARK BOX AS SHOWN 
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
 1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
 PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
 3 FALCONS     - 9½ 1:00p 4 BUCCANEERS  - 4½
 5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
 7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
 9 BENGALS - 9½ 1:00p 10 JETS  - 4½
 11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
 13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
 15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
 17 TEXANS  - 14½ 1:00p 18 BEARS + ½
 19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½
 21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
 23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
 25 COLTS     - 10½ 4:25p 26 LIONS - 3½
 27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
 PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
 29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
 31 RAMS  espn  - 9½ 10:20p 32 49ERS  - 4½

which drops most of the unwanted data.


In the context of this question it became apparent that the way processTextPosition and deleteCharsInPath calculate the end of a character baseline implicitly assumes horizontal text without page rotation. If one loosens one's criteria for "Visibility", though, one can assume a character to be visible iff the start of its baseline is visible. In that case one does not need that calculated Vector end anymore and the code works ok for rotated pages, too.


In the context of this question it became apparent that glyph origin coordinates exactly on the clip path borders can wander outside of the clip path due to floating point calculation errors. Switching to "fat point coordinate checks" turned out to be an acceptable work-around.

Question:

I have got a byte array of a pdf file and want to get the text out of the file. My code below works, but I need to create an actual file first. Do you know a better way, so I don't have to create this file first?

try {
  File temp = File.createTempFile("temp-pdf", ".tmp");
  OutputStream out = new FileOutputStream(temp);
  out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
  out.close();
  PDDocument document = PDDocument.load(temp);
  PDFTextStripper pdfStripper = new PDFTextStripper();
  String text = pdfStripper.getText(document);
  log.info(text);
} catch(IOException e){

}

Answer:

The answer depends on the version of PDFBox you use.

PDFBox 2.0.x

Whenever you have a byte[] (you appear to get one from Base64.decodeBase64), you can load it directly:

byte[] documentBytes = Base64.decodeBase64(testObject.getPdfAsDoc().getContent());
PDDocument document = PDDocument.load(documentBytes);
PDFBox 1.8.x

Whenever you have a byte[], you can load it via a ByteArrayInputStream:

byte[] documentBytes = Base64.decodeBase64(testObject.getPdfAsDoc().getContent());
InputStream documentStream = new ByteArrayInputStream(documentBytes);
PDDocument document = PDDocument.load(documentStream);

As an aside: When working with PDFBox 1.8.x you should use a loadNonSeq overload instead of load because load does not load a PDF as it is specified and, therefore, can be fooled to read it with wrong contents. In case of broken PDFs, though, you may still try load as fallback.

Question:

I'm attempting to perform some string validation against individual PDF pages in a file via the use of Apache PDFBox.

I'm going to be utilizing PDFTextStripper for the majority of this, so my first issue to tackle was the fact that all the PDFs i'm going to be validating against are generated as 2up; e.g Page 1 of 2 and page 2 of 2 were on the same page or if you imagine you literally scanned a book face down into a scanner - In addition to this, they were oriented incorrectly, and needed rotating 90 degrees so PDFTextStripper could read them properly.

Using elements of the below questions/solutions, i have built a method which first crops the page exactly in half, exports the cropped pages in order to a new file, rotates each page to the correct orientation and then saves the file;

Rotate PDF around its center using PDFBox in java

Split a PDF page in two parts [duplicate]

Visually, my method is seemingly working as expected until i run PDFTextStripper against it - It appears to be returning the text of not just the page i want, but also the page i cropped out of it.

To confirm the issue, I extracted a single page out of the entire document and saved it as a new file - when running PDFTextStripper, i still get the same results even though all i can see is literally one page. Adobe search doesn't bring up the hidden, legacy data either.

I can only assume that during my transform method, i need to redefine the cropped page with only the contents of the cropped page.

My question is, how can i do this?

p.s - i haven't posted my code as it's basically a amalgamation of the solutions provided in the aforementioned links above - however if it i needed, i can provide


Answer:

The PDFTextStripper ignores the CropBox you set to crop the pages. It also ignores whether text is covered by some filled rectangle or image or whether the text is invisible, it extracts all text (except text in patterns or contains in Type 3 font characters).

You might want to try the PDFTextStripperByArea instead. This class (which is derived from PDFTextStripper) restricts itself to regions you can define.

(Unfortunately these regions have to be defined using a different coordinate system than the one used for the CropBox, so usually you will have to transform the coordinates first.)

Question:

I am parsing PDF documents with apache's PDFBox version 2.0.x I have seen many questions about separating the header/footer from the actual content. My findings are that there is some notion (in my sample PDF) of these sections since the SortByPosition flag has an effect on the order in which the content is written to text. When I set SortByPosition to false, I get first the header/footer and next the body (and this repeats for every page). When I set SortByPostion to true, I get the content in the order as it appears on the screen in my PDF Reader.

PDFTextStripper textStripper = new PDFTextStripper();
textStripper.setSortByPosition(true);

String content = textStripper.getText(pdf);
System.out.println(content);

So internally these pieces of text are available a seperate "textblocks". My question is: is there a way for me to access those block seperately ?

Below the output of this piece of code with the sort flag to true

Header PDF MIC

Vandaag meer dan 1 pagina Door mij geschreven

1

Header PDF MIC

Dan is dit pagina 2 Met veel meer teksten en woorden.

2

And this is the output with the sort flag to false

Header PDF MIC

1

Vandaag meer dan 1 pagina

Door mij geschreven

Header PDF MIC

2

Dan is dit pagina 2

Met veel meer teksten en woorden.


Answer:

Thanks to @mkl for the hints and tips, here is wat I did:

Piece of Java code that uses the PDFMarkedContentExtractor, this only takes the first page, but we can apply this to all pages. Below the code you can find the output of the System.out log.

ExtraMetaData emd = new ExtraMetaData(); //this my own class to carry the header and footer    

PDFMarkedContentExtractor markedContentExtractor = new PDFMarkedContentExtractor();
                markedContentExtractor.processPage(document.getPage(0));
                List<PDMarkedContent>  markedContents = markedContentExtractor.getMarkedContents();
                for (Iterator iterator = markedContents.iterator(); iterator.hasNext();) {
                    PDMarkedContent pdMarkedContent = (PDMarkedContent) iterator.next();
                    System.out.println(pdMarkedContent.getTag()+" --> "+pdMarkedContent.getContents()+ " "+pdMarkedContent.getProperties());
                    COSDictionary pdmcProperties = pdMarkedContent.getProperties();
                    if (pdmcProperties.containsKey("Subtype")) {
                        COSBase cosBase =  pdmcProperties.getDictionaryObject("Subtype");
                        if (((COSName)cosBase).getName().equalsIgnoreCase("Footer")) {
                            emd.setFooter(getContentAsString(pdMarkedContent));
                        }
                        if (((COSName)cosBase).getName().equalsIgnoreCase("Header")) {
                            emd.setHeader(getContentAsString(pdMarkedContent));
                        }           
                    }
                    if (pdmcProperties.containsKey("Attached")) {
                        COSArray cosArray = (COSArray) pdmcProperties.getDictionaryObject("Attached");
                        for (COSBase cosBase2 : cosArray) {
                            if (((COSName)cosBase2).getName().equalsIgnoreCase("Bottom")) {
                                emd.setFooter(getContentAsString(pdMarkedContent));
                            }
                            if (((COSName)cosBase2).getName().equalsIgnoreCase("Top")) {
                                emd.setHeader(getContentAsString(pdMarkedContent));
                            }           

                        }
                    }

                }

Output

Artifact --> [-, , 1, , -, ] COSDictionary{COSName{Attached}:COSArray{[COSName{Top}]};COSName{Type}:COSName{Pagination};}

Artifact --> [ ] COSDictionary{COSName{Attached}:COSArray{[COSName{Bottom}]};COSName{Type}:COSName{Pagination};}

P --> [M, J, , -, , 2, 0, 1, 6, 2, 1, 7, 2, , /, , B, ] COSDictionary{COSName{MCID}:COSInt{0};}

Question:

I'm trying to rotate text using pdfbox by I couldn't achieve it. I tried to set the texMatrix but my text is not rotating as intended.

Does someone have an idea of how I could turn at 90 degrees my text?

This is my code :

contentStream.beginText();

 float tx = titleWidth / 2;
 float ty = titleHeight / 2;

contentStream.setTextMatrix(Matrix.getTranslateInstance(tx, ty)); 
contentStream.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(90),tx,ty));
contentStream.setTextMatrix(Matrix.getTranslateInstance(-tx, -ty));

 contentStream.newLineAtOffset(xPos, yPos);

contentStream.setFont(font, fontSize);
contentStream.showText("Tets");
contentStream.endText();

Thank You


Answer:

Here's a solution that draws three pages, one with text unrotated, one with text rotated but keeping the coordinates as if planning landscape printing, and one that is what you wanted (rotated around the center of the text). My solution is close to that, it rotates around the bottom of the center of the text.

public static void main(String[] args) throws IOException
{
    PDDocument doc = new PDDocument();
    PDPage page1 = new PDPage();
    doc.addPage(page1);
    PDPage page2 = new PDPage();
    doc.addPage(page2);
    PDPage page3 = new PDPage();
    doc.addPage(page3);

    PDFont font = PDType1Font.HELVETICA;
    float fontSize = 20;
    int xPos = 100;
    int yPos = 400;
    float titleWidth = font.getStringWidth("Tets") / 1000;
    float titleHeight = fontSize;
    float tx = titleWidth / 2;
    float ty = titleHeight / 2;

    try (PDPageContentStream contentStream = new PDPageContentStream(doc, page1))
    {
        contentStream.beginText();

        contentStream.newLineAtOffset(xPos, yPos);

        contentStream.setFont(font, fontSize);
        contentStream.showText("Tets");
        contentStream.endText();
    }

    // classic case of rotated page
    try (PDPageContentStream contentStream = new PDPageContentStream(doc, page2))
    {
        contentStream.beginText();

        Matrix matrix = Matrix.getRotateInstance(Math.toRadians(90), 0, 0);
        matrix.translate(0, -page2.getMediaBox().getWidth());

        contentStream.setTextMatrix(matrix);

        contentStream.newLineAtOffset(xPos, yPos);

        contentStream.setFont(font, fontSize);
        contentStream.showText("Tets");
        contentStream.endText();
    }

    // rotation around text
    try (PDPageContentStream contentStream = new PDPageContentStream(doc, page3))
    {
        contentStream.beginText();

        Matrix matrix = Matrix.getRotateInstance(Math.toRadians(90), 0, 0);
        matrix.translate(0, -page3.getMediaBox().getWidth());

        contentStream.setTextMatrix(matrix);

        contentStream.newLineAtOffset(yPos - titleWidth / 2 - fontSize, page3.getMediaBox().getWidth() - xPos - titleWidth / 2 - fontSize);

        contentStream.setFont(font, fontSize);
        contentStream.showText("Tets");
        contentStream.endText();
    }
    doc.save("saved.pdf");
    doc.close();
}

Question:

File example: file.

Problem - when extracting text using PdfTextStripper, there is token "9/1/2017" and "387986" after "ASSETS" in the page start which should be removed, and some others hidden tokens.

I have already applied this solution (so I do not copy-paste it here, because actually problem is exactly the same) and still that hidden text is appearing on page. Could it be hidden by something else except clip path? thanks!


Answer:

Could it be hidden by something else except clip path?

Yes. In case of your new document the text is written in white on white, e.g. the 387986 after ASSETS is drawn like this:

1 1 1 rg
/TT0 16 Tf
-1011.938 115.993 Td
(@A,BAC)Tj 

The initial 1 1 1 rg sets the fill color to RGB WHITE. (Additionally that text is quite tiny but would still be visible if drawn in e.g. BLACK.)

The solution you refer to was implemented for documents like the sample document presented in that issue in which the invisible text is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, your white text won't be recognized by it as hidden.

Unfortunately recognizing invisibility of WHITE on WHITE text is more difficult to determine than that of clipped or covered text because one not only needs to know the a property of the current graphics state (like the clip path) or remove all text inside a given path, one also needs to know the color of the part of the page right before the text is drawn (to check the on WHITE detail).

If, on the other hand, you assume the page background to be essentially WHITE, it is fairly simple to ignore all white text: Simply also detect the current fill color in processTextPosition:

PDColor fillColor = gs.getNonStrokingColor();

and compare it to the flavors of WHITE you want to consider invisible. (Usually it should suffice to compare with RGB, CMYK, and Grayscale WHITE; in seldom cases you'll also have to correctly interpret more complex color spaces. Additionally you might also consider nearly WHITE colors invisible, (.99, .99, .99) RGB can hardly be distinguished from WHITE.)

If you find the current color to be WHITE, ignore the current TextPosition.

Be aware, though, just like the solution you referenced this is not yet the final solution recognizing all WHITE text: For that you'll also have to check the text rendering mode: If it is just filling (the default), the above holds, but if it is (also) stroking, you'll (also) have to consider the stroking color; if it is rendered invisible, there is no color to consider; and if the text rendering mode includes adding to path for clipping, you'll have to wait and determine what will be later drawn in this part of the page as long as the clip path holds, definitely not trivial!

Question:

I’m using PDFBox 2.0.4 to create PDF documents with acroForms. Here is my test code example:

PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);

PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);

String dir = "../testPdfBox/src/main/resources/fonts/";
PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf"));

PDResources resources = new PDResources();
String fontName = resources.add(font).getName();
acroForm.setDefaultResources(resources);

String defaultAppearanceString = format("/%s 12 Tf 0 g", fontName);
acroForm.setDefaultAppearance(defaultAppearanceString);

PDTextField field = new PDTextField(acroForm);
field.setPartialName("SampleField");
field.setDefaultAppearance(defaultAppearanceString);
acroForm.getFields().add(field);

PDAnnotationWidget widget = field.getWidgets().get(0);
PDRectangle rect = new PDRectangle(50, 750, 200, 50);
widget.setRectangle(rect);
widget.setPage(page);
widget.setPrinted(true);

page.getAnnotations().add(widget);

field.setValue("Sample field 123456");

acroForm.flatten();

document.save("target/SimpleForm.pdf");
document.close();

Everything works fine. But when I try to copy text from the created document and paste it to the NotePad or Word it becomes squares.

􀀷􀁅􀁑􀁔􀁐􀁉􀀄􀁊􀁍􀁉􀁐􀁈􀀄􀀕􀀖􀀗􀀘􀀙􀀚

I search a lot about this problem. The most popular answer is that there is no toUnicode cmap in created PDF. So I explore my document with CanOpener for Acrobat:

Yes, there is no toUnicode cmap, but everything works properly, if not to use acroForm.flatten(). When form fields are not flattened, I can copy/paste text from the document and it looks correct. Nevertheless I need all fields to be flattened.

So, I have two questions:

  1. Why there is a problem with copy/pasting text in flattened form, and everything is ok in non-flattened?

  2. What can I do to avoid problem with text copy/pasting? Is there only one solution - to create toUnicode CMap by my own, like in this example?

My test pdf files are available here.


Answer:

Please replace

PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf"));

with

PDType0Font font = PDType0Font.load(document, new FileInputStream(dir + "Roboto-Regular.ttf"), false);

This makes sure that the font is embedded in full and not just as a subset.

Question:

I'm using the following method to create a PDF file:

private void createPdf() throws IOException {
    PDDocument doc = new PDDocument();
    PDPage page = new PDPage();
    doc.addPage(new PDPage());

    PDPageContentStream content = new PDPageContentStream(doc, page);

    content.beginText();
    content.setFont(PDType1Font.HELVETICA, 26);
    content.showText("Example Text");
    content.endText();

    content.close();

    doc.save("report.pdf");
    doc.close();
}

It creates a new file with a white page, but no text is shown. What's wrong?

I use Apache PDFBox 2.0.7.


Answer:

Change this code

PDPage page = new PDPage();
doc.addPage(new PDPage());

to this

PDPage page = new PDPage();
doc.addPage(page);

You made the mistake to add a new page with nothing. The operations you did were done on another object.

Your text should now be visible at the bottom of the page. (y = 0 is bottom in PDF)

Question:

I'm using Apache PDFBox 2.0.2. Loading pdf documents from web to get a text inside.

URL u = new URL("url/to/file.pdf");
PDDocument pddDocument = PDDocument.load(u.openStream());

PDFTextStripper textStripper = new PDFTextStripper();
String doc = textStripper.getText(pddDocument);

The problem is, sometimes I got IllegalArgumentException: "Symbolic fonts must have a built-in encoding" and can't extract text from the PDF.

Please help.


Answer:

As already indicated by @Tilman opening a bug issue in the PDFBox Jira, this behavior is a bug:

The DictionaryEncoding constructor retrieves an Encoding instance for the base encoding of a font using Encoding.getInstance and is well aware that this method may return null:

base = Encoding.getInstance(name); // may be null

If it is null, though, and PDFBox has not been able to determine a built-in encoding of the font, the observed exception is thrown:

throw new IllegalArgumentException("Symbolic fonts must have a built-in " + 
                                   "encoding");

In the case at hand, the base encoding is MacExpertEncoding which is one of the possible base encodings explicitly named by the PDF specification. Unfortunately Encoding.getInstance does not know this encoding and, therefore, returns null which in turn triggers the exception as PDFBox also could not identify a built-in encoding.


Thus, a fix should include the addition of an Encoding class for MacExpertEncoding and extending Encoding.getInstance accordingly.

Furthermore, one should consider not throwing the exception at all: There are numerous situation where there is no need for an implicit or explicit base encoding, e.g. if the Differences explicitly provide a mapping for each character code or (in case of pure text extraction) if the font has a good ToUnicode table.

Question:

I want to verify/assert certain set of text or sentence in each PDF files automatically. I have 1000s of PDF files which needs to be verified whether a specific text/sentence is present in it.


Answer:

You can do this by using Apache Lucene and Apache pdfbox. Please refer to this post: http://www.programming-free.com/2012/11/simple-word-search-in-pdf-files-using.html

Question:

//for writing filenames           
PDDocument doc = PDDocument.load(this.getClass().getResourceAsStream("/Vorlagen/Analyze/ReportTemplate.pdf"));
PDPage curFileNamePage = new PDPage(PDRectangle.A4);
doc.addPage(curFileNamePage);

contentStream = new PDPageContentStream(doc, curFileNamePage, PDPageContentStream.AppendMode.APPEND, true, true);
contentStream.setFont(pdfFont, 12);
contentStream.beginText();
float curYVal = 650f;
contentStream.newLineAtOffset(20, curYVal);

for (int idx = 0; idx < 377; idx++) {
    if (curYVal - 15f > 0) {
        curYVal = curYVal - 15f;
        contentStream.newLineAtOffset(0, curYVal);
        contentStream.showText("" + idx);
    } else {
        contentStream.endText();
        contentStream.close(); // close writing area
        curFileNamePage = new PDPage(PDRectangle.A4);
        doc.addPage(curFileNamePage);

        contentStream = new PDPageContentStream(doc, curFileNamePage, PDPageContentStream.AppendMode.APPEND, true, true);
        contentStream.setFont(pdfFont, 12);
        contentStream.beginText();
        curYVal = 650f;
        contentStream.newLineAtOffset(0, curYVal);
        contentStream.showText("" + idx);
            }

}
contentStream.endText();
contentStream.close(); // close writing area

doc.save("C:\\Users\\noname\\Desktop\\765.pdf");//Saving the document
doc.close();

So I have a number of indexes (in this case 377) . My aim is to create a pdf where the indexes gets printed out one after another. (in this case it should go 15f down the y-axis). If the end of the page was reached it should create a new page and start from the beginning.

Now as you would guess the code does not behave as it should. After I execute the code the pdf file gets created with 9 pages but the funny part is each page contains only one number (except first page). The number are: 43,87,131,175,219,263,307,351.

What am I doing wrong?

This is what the output should look like:

This is my current output:


Answer:

Change the first (not the second)

contentStream.newLineAtOffset(0, curYVal);

to

contentStream.newLineAtOffset(0, -15f);

Because this is a relative position. So the large value makes sense only the first time in a text segment (when it is relative to 0,0). After it is positioned the first time, just substract the offset.

Question:

I have some code that takes a template PDF, creates a new PDF, overlays the new PDF over the template PDF and writes the result to a stream. All this using PDFBox 2.0.4.

The problem is that copy-pasting text from the generated PDF to a text editor results in garbage text.

This happens only for the text that was added by my code, the text in the original template still works fine. The text that I add gets added using a custom font.

How do I fix the generated PDF so that the text can be copy-pasted?

SSCCE:

public class PDFTest {

    private static final String FONT = "/fonts/font.ttf";

    public static void main(final String... args) throws IOException, FontFormatException {
        final Overlay overlay = new Overlay();
        overlay.setInputPDF(newDocument("Input text", 400));
        overlay.setAllPagesOverlayPDF(newDocument("Test text", 200));

        try (final PDDocument document = overlay.overlay(new HashMap<>())) {
            document.save("example.pdf");
        }
    }

    private static PDDocument newDocument(final String text, final int offsetY) throws IOException, FontFormatException {
        final PDDocument document = new PDDocument();
        document.addPage(insertTextInPage(document, text, offsetY));
        return document;
    }

    private static PDPage insertTextInPage(final PDDocument document, final String text, final int offsetY) throws IOException, FontFormatException {
        try (final InputStream fontStream = PDFTest.class.getResourceAsStream(FONT)) {
            final PDFont normalFont = PDType0Font.load(document, fontStream);

            final PDPage page = new PDPage();
            try (final PDPageContentStream contentStream = new PDPageContentStream(document, page, APPEND, false)) {
                addTextBlock(contentStream, normalFont, text, offsetY);
            }
            return page;
        }
    }

    private static void addTextBlock(final PDPageContentStream contentStream, final PDFont font, final String text, final int offsetY)
            throws IOException {
        contentStream.beginText();
        contentStream.setFont(font, 16);
        contentStream.newLineAtOffset(20, offsetY);
        contentStream.showText(text);
        contentStream.endText();
    }
}

Answer:

This is a known issue (PDFBOX-3243), files constructed with subsetted fonts (you are using PDType0Font.load() which is very efficient) are in an intermediate state until they get saved, which is the time when the subsetting takes place.

Solution for you: either save and reload, or save to a dummy. In Windows I changed newDocument like this and it worked:

private static PDDocument newDocument(final String text, final int offsetY) throws IOException, FontFormatException
{
    final PDDocument document = new PDDocument();
    document.addPage(insertTextInPage(document, text, offsetY));
    document.save("nul"); // NEW!
    return document;
}

Question:

I am trying to make text selectable at PDF reading application made on JavaFX. I have PDF files that contain screenshots with text and OCR layer. So I need the text to be selectable like at regular viewer. I set up getting image from page and now trying to figure out how to highlight text.

I tried following:

    InputStream is = this.getClass().getResourceAsStream(currentPdf);
    Image convertedImage;
    try {
        PDDocument document = PDDocument.load(is);
        List<PDPage> list = document.getDocumentCatalog().getAllPages();
        PDPage page = list.get(pageNum);
        List annotations = page.getAnnotations();
        PDAnnotationTextMarkup markup = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
        markup.setRectangle(new PDRectangle(600, 600));
        markup.setQuadPoints(new float[]{100, 100, 200, 100, 100, 500, 200, 500});
        annotations.add(markup);
        page.setAnnotations(annotations);
        BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 128);
        convertedImage = SwingFXUtils.toFXImage(image, null);
        document.close();
        imageView.setImage(convertedImage);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }

but that results in image without any highlights.

I also tried to find information at stack overflow or other resources, but haven't found anything.

Would appreciate some Java code sample which enables text highlighting with mouse.


Answer:

I used ICEpdf and did the following:

question.getSelectedBounds()
                .stream()
                .map(Shape::getBounds)
                .forEach(bounds -> {
                    SquareAnnotation squareAnnotation = (SquareAnnotation)
                            AnnotationFactory.buildAnnotation(
                                    pdfController.getPageTree().getLibrary(),
                                    Annotation.SUBTYPE_SQUARE,
                                    bounds);
                    squareAnnotation.setFillColor(true);
                    squareAnnotation.setFillColor(new Color(255, 250, 57, 120));
                    squareAnnotation.setRectangle(bounds);
                    squareAnnotation.setBBox(bounds);
                    squareAnnotation.resetAppearanceStream(null);
                    AbstractAnnotationComponent annotationComponent = AnnotationComponentFactory
                            .buildAnnotationComponent(squareAnnotation, pdfController.getDocumentViewController(),
                                    pageViewComponent, pdfController.getDocumentViewController().getDocumentViewModel());
                    pageViewComponent.addAnnotation(annotationComponent);
                });

Question:

Stuck with PDFBox-Android, pdfbox-android:1.8.9.0

I load first page of a pdf file, write text in and import this page to a new page of a final document.

Problem is when create new pages, it use last page which contain previous text... So, first page is ok, but nexts have texts superposed..

private File writeReport() {
    File fileSource = new File(getActivity().getApplicationContext().getExternalCacheDir(), "fileSource.pdf");

    // get file model in assets
    InputStream inputAsset = null;
    try {
        inputAsset = getActivity().getApplicationContext().getResources().getAssets().open("file_model.pdf");
    } catch (IOException e) {
        e.printStackTrace();
    }

    // copy file model in fileSource
    try {
        OutputStream outputStream = new FileOutputStream(file);
        byte buffer[] = new byte[1024];
        int length = 0;

        while ((length = inputAsset.read(buffer)) > 0) {
            outputStream.write(buffer, 0, length);
        }
        outputStream.close();
        inputAsset.close();
    } catch (IOException e) {
        System.out.print("Copy assets : IOException" + e.getMessage());
    }

    File fileTarget = new File(getActivity().getApplicationContext().getCacheDir(), "fileTarget.pdf");


    try {
        PDFBoxResourceLoader.init(getActivity().getApplicationContext());       // init lib

        PDDocument documentSource = PDDocument.load(fileSource);
        PDDocument documentTarget = new PDDocument();

        // iteration == a new page
        for(int i=0 ; i < 3 ; i++)
        {
            PDPage page = documentSource.getDocumentCatalog().getPages().get(0);
            PDPageContentStream contentStream = new PDPageContentStream(documentSource, page, true, true);

            PDFont font1 = PDType1Font.HELVETICA;

            float startY = page.getMediaBox().getUpperRightY();
            float factor = 2.83f;

            page.getStream();

            // add text
            contentStream.beginText();
            contentStream.setFont(font1, 10);
            contentStream.newLineAtOffset(factor * 60, startY - (factor * 53));
            contentStream.showText("test text" + i);
            contentStream.endText();

            contentStream.close();

            // import source page to output file->  Problem here ! new page contain old overlay text... 
            documentTarget.importPage(page);
        }
        documentTarget.save(fileTarget);
        documentTarget.close();
        documentSource.close();

    } catch (IOException e) {
        e.printStackTrace();
    }
    return fileTarget;
}

Is there a way to have fresh page at every iteration ? Thanks !


Answer:

The problem is that you reused the same PDPage object, which resulted in the effect that it contained the text of the previous iteration. Solution: use the result of

documentTarget.importPage(page)

and work with that one. Because that is a new PDPage object with everything cloned. So your new code will be like this:

    // iteration == a new page
    for(int i=0 ; i < 3 ; i++)
    {
        PDPage page = documentSource.getDocumentCatalog().getPages().get(0);
        page = documentTarget.importPage(page); // this is now a new PDPage object
        PDPageContentStream contentStream = new PDPageContentStream(documentSource, page, true, true);

        PDFont font1 = PDType1Font.HELVETICA;

        float startY = page.getMediaBox().getUpperRightY();
        float factor = 2.83f;

        page.getStream();

        // add text
        contentStream.beginText();
        contentStream.setFont(font1, 10);
        contentStream.newLineAtOffset(factor * 60, startY - (factor * 53));
        contentStream.showText("test text" + i);
        contentStream.endText();

        contentStream.close();            
    }

Question:

I have overlooked all questions related to this issue on SO, but cant find and answer.

I have a textFile which contains unicode chars like "ā", "š", "ī" and others. The problem is that, when i write textFile to PDF, pdf file do not display it correctly.

How to set up my code, so i could write these chars on my PDF? Maybe even better question is: Is that even possible? Since i have been looking for this for few hours and can't find a solution.

Since this app will be commercial, i cant use iText!

My Code:

TextToPDF pdf = new TextToPDF();
String fileName = "test.txt";
File pdfFile = new File("test.pdf");

BufferedReader reader = new BufferedReader(new FileReader(fileName));

PDSimpleFont courier = PDType1Font.COURIER;
PDSimpleFont testFont = PDTrueTypeFont.loadTTF( document, new File("times.ttf" ));

pdf.setFont(testFont);
pdf.setFontSize(8);

pdf.createPDFFromText(document, reader);

document.save(pdfFile);
document.close();

If someone has done this, pls share how u manage to do taht. I believe it should be related with font.setFontEncoding(); But since PDFBox documentation is lacking quite a lot of information, i havent figured it out, what or how i should do this.

By the way here is the list of SO questions i have read, so pls dont redirect me back to them...

1) Java PDFBOX text encoding

2) Using Java PDFBox library to write Russian PDF

3) Using PDFBox to write UTF-8 encoded strings to a PDF

There was more topic i read, but these was still opened in my tab.

EDITED: Just found this -> Using PDFBox to write unicode strings to a PDF

Seems it's not possbile, need to update to version 2.0.0 and give it a try.

EDITED #2: In new version of PDFBox 2.0.0 (atleast now) has been removed the class TextToPDF() which let me pass in textFile. So now it means, that either i manually read the text and then write it to PDF, or need to find some other solutions.


Answer:

Your Problem is here:

BufferedReader reader = new BufferedReader(new FileReader(fileName));

As described here: http://docs.oracle.com/javase/7/docs/api/java/io/FileReader.html The FileReader will read the file in System default encoding. Change it to this:

BufferedReader in = new BufferedReader(
           new InputStreamReader(
                      new FileInputStream(fileDir), "UTF8"));

This would read your file in UTF-8 if it is in UTF-8. Special chars as you described exist in alout of character encoding like iso latin 1 etc.

When you know the encoding of your input make sure to read it in this encoding. Then PDFBox can write them in his desired encoding, too.

Question:

I'm trying to replace text in pdf and it's kind of replaced, this is my code

PDDocument doc = null;
    int occurrences = 0;
    try {
        doc = PDDocument.load("test.pdf"); //Input PDF File Name
        List pages = doc.getDocumentCatalog().getAllPages();
        for (int i = 0; i < pages.size(); i++) {
            PDPage page = (PDPage) pages.get(i);
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream());
            parser.parse();
            List tokens = parser.getTokens();
            for (int j = 0; j < tokens.size(); j++) {
                Object next = tokens.get(j);
                if (next instanceof PDFOperator) {
                    PDFOperator op = (PDFOperator) next;
                    // Tj and TJ are the two operators that display strings in a PDF
                    if (op.getOperation().equals("Tj")) {
                        // Tj takes one operator and that is the string
                        // to display so lets update that operator
                        COSString previous = (COSString) tokens.get(j - 1);
                        String string = previous.getString();
                        if (string.contains("Good")) {
                            string = string.replace("Good", "Bad");
                            occurrences++;
                        }
                        //Word you want to change. Currently this code changes word "Good" to "Bad"
                        previous.reset();
                        previous.append(string.getBytes("ISO-8859-1"));
                    } else if (op.getOperation().equals("TJ")) {
                        COSArray previous = (COSArray) tokens.get(j - 1);
                        COSString temp = new COSString();

                        String tempString = "";
                        for (int t = 0; t < previous.size(); t++) {

                            if (previous.get(t) instanceof COSString) {
                                tempString += ((COSString) previous.get(t)).getString();

                            }
                        }

                        temp.append(tempString.getBytes("ISO-8859-1"));
                        tempString = "";
                        tempString = temp.getString();
                        if (tempString.contains("Good")) {
                            tempString = tempString.replace("Good", "Bad");
                            occurrences++;
                        }
                        previous.clear();

                        String[] stringArray = tempString.split(" ");

                        for (String string : stringArray) {
                            COSString cosString = new COSString();
                            string = string + " ";
                            cosString.append(string.getBytes("ISO-8859-1"));
                            previous.add(cosString);
                        }

                    }
                }
            }
            // now that the tokens are updated we will replace the page content stream.
            PDStream updatedStream = new PDStream(doc);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens(tokens);
            page.setContents(updatedStream);
        }
        System.out.println("number of matches found: " + occurrences);
        doc.save("a.pdf"); //Output file name
    } catch (IOException ex) {
        Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
    } catch (COSVisitorException ex) {
        Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
    } finally {
        if (doc != null) {
            try {
                doc.close();
            } catch (IOException ex) {
                Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
    }

the issue that it's replaced in a bad characters or hidden shape ( as example the bad word becomes only d character), but if i copy and paste it in another place it paste the expected word correctly, also when i search the generated pdf for the new word it doesn't find it, but when i search with the old word it finds it in the replaced places


Answer:

I found aspose, this link shows how to use it to replace text in pdfs, it's easy and works perfect except that it's not free, so the free version is printing copyrights line on the head of pdf file pages http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document

Question:

I wrote a simple program in Java using PDFBox to extract words from a PDF file. It reads the text from PDF and extract word by word.

public class Main {

    public static void main(String[] args) throws Exception {
        try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

Is there a way to extract the words without duplicates?


Answer:

  1. Split each line by space - line.split(" ")
  2. Maintain a HashSet to hold these words and keep adding all the words to it.

HashSet by its nature will ignore the duplicates.

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}

Question:

I want to create a PDF using pdfbox (https://pdfbox.apache.org/cookbook/documentcreation.html). However, pdfbox does not seem to provide dynamic text layout mechanisms like those a text editor like OpenOffice provides (automatic text flow using predefined text formattings like block format, centered text, line breaks etc.).

Is there any Java library that provides that functionality on top of pdfbox or separate from it? Or do you have any free code available?


Answer:

I had the same problem, that's why I started PDFBox-Layout. It has support for simple word wrapping, text and paragraph alignment, pagination, vertical and column layout, and markup for easy bold/italic highlighting.

See the Wiki for more information. Maybe you will find it useful :-)

Question:

As the title says, I want to filter out all text from a PDF that is above a certain font size. Currently, I am using the PDFBox library but I am open to using any other free library for Java.

My approach was to use a PDFStreamParser to iterate through the tokens. When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen. However, it has become clear to me that this relatively simple approach will not work because the text may be scaled by the current transformation matrix.

Is there a better approach I could be taking, or a way to make my approach work without it getting too complicated?


Answer:

Your approach

When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen.

is too simple.

On one hand, as you remark yourself,

the text may be scaled by the current transformation matrix.

(Actually not only by the transformation matrix but also by the text matrix!)

Thus, you have to keep track of these matrices.

On the other hand Tf doesn't only set the base font size for the next text drawing instruction seen, it sets it until the size is explicitly changed by some other instruction.

Furthermore, the text font size and the current transformation matrix are part of the graphics state; thus, they are subject to save state and restore state instructions.

To edit a content stream with respect to the current state, therefore, you have to keep track of a lot of information. Fortunately, PDFBox contains classes to do the heavy lifting here, the class hierarchy based on the PDFStreamEngine, allowing you to concentrate on your task. To have as much information as possible available for editing, the PDFGraphicsStreamEngine class appears to be a good choice to build upon.

A generic content stream editor class

Thus, let's derive PdfContentStreamEditor from PDFGraphicsStreamEngine and add some code for generating a replacement content stream.

public class PdfContentStreamEditor extends PDFGraphicsStreamEngine {
    public PdfContentStreamEditor(PDDocument document, PDPage page) {
        super(page);
        this.document = document;
    }

    /**
     * <p>
     * This method retrieves the next operation before its registered
     * listener is called. The default does nothing.
     * </p>
     * <p>
     * Override this method to retrieve state information from before the
     * operation execution.
     * </p> 
     */
    protected void nextOperation(Operator operator, List<COSBase> operands) {

    }

    /**
     * <p>
     * This method writes content stream operations to the target canvas. The default
     * implementation writes them as they come, so it essentially generates identical
     * copies of the original instructions {@link #processOperator(Operator, List)}
     * forwards to it.
     * </p>
     * <p>
     * Override this method to achieve some fancy editing effect.
     * </p> 
     */
    protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
        contentStreamWriter.writeTokens(operands);
        contentStreamWriter.writeToken(operator);
    }

    // stub implementation of PDFGraphicsStreamEngine abstract methods
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }

    @Override
    public void drawImage(PDImage pdImage) throws IOException { }

    @Override
    public void clip(int windingRule) throws IOException { }

    @Override
    public void moveTo(float x, float y) throws IOException { }

    @Override
    public void lineTo(float x, float y) throws IOException { }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }

    @Override
    public Point2D getCurrentPoint() throws IOException { return null; }

    @Override
    public void closePath() throws IOException { }

    @Override
    public void endPath() throws IOException { }

    @Override
    public void strokePath() throws IOException { }

    @Override
    public void fillPath(int windingRule) throws IOException { }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException { }

    @Override
    public void shadingFill(COSName shadingName) throws IOException { }

    // PDFStreamEngine overrides to allow editing
    @Override
    public void processPage(PDPage page) throws IOException {
        PDStream stream = new PDStream(document);
        replacement = new ContentStreamWriter(replacementStream = stream.createOutputStream(COSName.FLATE_DECODE));
        super.processPage(page);
        replacementStream.close();
        page.setContents(stream);
        replacement = null;
        replacementStream = null;
    }

    @Override
    public void showForm(PDFormXObject form) throws IOException {
        // DON'T descend into XObjects
        // super.showForm(form);
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        nextOperation(operator, operands);
        super.processOperator(operator, operands);
        write(replacement, operator, operands);
    }

    final PDDocument document;
    OutputStream replacementStream = null;
    ContentStreamWriter replacement = null;
}

(PdfContentStreamEditor class)

This code overrides processPage to create a new page content stream and eventually replace the old one with it. And it overrides processOperator to provide the processed instruction for editing.

For editing one simply overrides write here. The existing implementation simply writes the instructions as they come while you may change the instructions to write. Overriding nextOperation allows you to peek at the graphics state before the current instruction is applied to it.

Applying the editor as is,

PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page);
    identity.processPage(page);
}
document.save(RESULT);

(EditPageContent test testIdentityInput)

therefore, will create a result PDF with equivalent content streams.

Customizing the content stream editor for your use case

You want to

filter out all text from a PDF that is above a certain font size.

Thus, we have to check in write whether the current instruction is a text drawing instruction, and if it is, we have to check the current effective font size, i.e. the base font size transformed by the text matrix and the current transformation matrix. If the effective font size is too large, we have to drop the instruction.

This can be done as follows:

PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page) {
        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                float fs = getGraphicsState().getTextState().getFontSize();
                Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
                Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
                Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
                double transformedFs = transformedFsVector.distance(transformedOrigin);
                if (transformedFs > 100)
                    return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    identity.processPage(page);
}
document.save(RESULT);

(EditPageContent test testRemoveBigTextDocument)

Strictly speaking completely dropping the instruction in question may not suffice; instead, one would have to replace it with an instruction to change the text matrix just like the dropped text drawing instructions would have done. Otherwise the following not-dropped text may be moved. Often, though, this does work as is because the text matrix is newly set for the following different text. So let's keep it simple here.

Constraints and remarks

This PdfContentStreamEditor only edits the page content stream. From there XObjects and Patterns may be used which are currently not edited by the editor. It should be easy, though, to, after editing the page content stream, recursively iterate of the XObjects and Patterns and edit them in a similar fashion.

This PdfContentStreamEditor essentially is a port of the PdfContentStreamEditor for iText 5 (.Net/Java) from this answer and the PdfCanvasEditor for iText 7 from this answer. The examples for using those editor classes may give some hints on how to use this PdfContentStreamEditor for PDFBox.

A similar (but less generic) approach has been used previously in the HelloSignManipulator class in this answer.

Fixing a bug

In the context of this question a bug in the PdfContentStreamEditor was found which caused some text lines in the example PDF in focus there to be moved.

The background: Some PDF instructions are defined via other ones, e.g. tx ty TD is specified to have the same effect as -ty TL tx ty Td. The corresponding PDFBox OperatorProcessor implementations for simplicity work by feeding the equivalent instructions back into the stream engine.

The PdfContentStreamEditor as implemented above in such a case retrieves signals for both the replacement instructions and the original instructions and writes them all back into the result stream. Thus, the effect of those instructions is doubled. E.g. in case of the TD instruction the text insertion point is forwarded two lines instead of one...

Thus, we have to ignore the replacement instructions. For this replace the method processOperator above by

@Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
    if (inOperator) {
        super.processOperator(operator, operands);
    } else {
        inOperator = true;
        nextOperation(operator, operands);
        super.processOperator(operator, operands);
        write(replacement, operator, operands);
        inOperator = false;
    }
}

boolean inOperator = false;

Question:

I have a Inputstream of pdf file. Now I want to extract all string content from the inputStream. I found following examples. Should I use first one or second one ? Is there any performance difference between these two ? what is the use of PDFParser in second one?

snippet 1 :

PDDocument doc = PDDocument.load(is);

PDFTextStripper stripper = new PDFTextStripper();

String result = stripper.getText(doc);

snippet 2:

PDFTextStripper stripper = new PDFTextStripper();

PDFParser parser = new PDFParser(newRandomAccessBufferedFileInputStream(stream));

parser.parse();

doc = parser.getPDDocument();

String content = stripper.getText(doc);

Thanks in Advance !!!


Answer:

Use the first code. The second code will work too but is outdated and not different, the parsing stuff is called within load(). The speed is the same. You'll get the best results by using a file as parameter, or a byte array. Using a stream will require PDFBox to do some additional buffering. Your code does not tell where stream comes from; if it is a FileInputStream, then you should use File instead.

Question:


Answer:

As you did not supply your line drawing code, I here draw a line myself. You might have to adapt this to your situation.

To rotate text above the line, you have to change the text matrix or the current transformation matrix to rotate following content, e.g. like this:

PDDocument doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage(page);
PDPageContentStream cos = new PDPageContentStream(doc, page);
cos.transform(Matrix.getRotateInstance(-Math.PI / 6, 100, 650));
cos.moveTo(0, 0);
cos.lineTo(125, 0);
cos.stroke();
cos.beginText();
String text = "0.72";
cos.newLineAtOffset(50, 5);
cos.setFont(PDType1Font.HELVETICA_BOLD, 12);
cos.showText(text);
cos.endText();
cos.close();
doc.save("TextOnLine.pdf");
doc.close();

(RotatedTextOnLine test testRotatedTextOnLineForCedrickKapema)

I chose to use the current transformation matrix because that allowed me to rotate a horizontal line together with the text.

The result:

Question:

File example: test

Here in the 2nd row in the table, after "3500 RENT" there are 2 text tokens("1", "1") returned by PdfTextStripper but actually not visible in the original PDF. I know that it could be a clip path (like in the post here) or a color issue (like in the post here).

However, it looks like in this case it's hidden by some other means... the clip path does not overlap and the color is black for those tokens.

What else could it be?


Answer:

It is a color issue, the '1's are printed in white.

What makes the situation a bit special is that the ColorSpace in use is not your off-the-shelf DeviceRGB or DeviceGray but a Separation color space, and color values in Separation color spaces are always treated as subtractive colors. Thus, a tint value of 0.0 denotes the lightest color that can be achieved with the given colorant, and 1.0 is the darkest. This convention is the same as for DeviceCMYK color components but opposite to the one for DeviceGray and DeviceRGB.

(cf. ISO 32000-1 section 8.6.6.4 "Separation Colour Spaces")

Inside view

Your content stream starts like this:

/Cs8 cs 1 scn

Cs8 is a Separation color space:

/Cs8 [/Separation /Black [/ICCBased 17 0 R] 18 0 R] 

with an ICCBased alternate space which in turn has DeviceRGB as alternate space

17 0 obj
<<
/Length 2597
/Alternate /DeviceRGB
/Filter /FlateDecode
/N 3
>>
stream
[...ICC profile...]
endstream
endobj 

and a tint transform by samples to the alternate color space

18 0 obj
<<
/Length 779
/BitsPerSample 8
/Decode [0 1 0 1 0 1]
/Domain [0 1]
/Encode [0 254]
/Filter /FlateDecode
/FunctionType 0
/Range [0 1 0 1 0 1]
/Size [255]
>>
stream
[...255 samples from (255,255,255) to (35,31,32)...]
endstream
endobj 

Your content stream continues with operations drawing the headers and the start of the first row and then

/TT2 1 Tf
0 scn
13.559 0 TD
6.8438 Tc
<00140014>Tj
1 scn 

0 scn sets the color to the lightest Cs8 BLACK separation color which is mapped by sample to (255,255,255) on screen which will be pretty white, 6.8438 Tc sets a large character spacing (resulting in the gap between the two '1's), <00140014>Tj draws the two '1's, and 1 scn switches back to the darkest Cs8 BLACK separation color mapped by sample to (35,31,32) on screen which will be a very dark grayish color.

With PDFBox

In a comment you say

when I debug it in processTextPosition(TextPosition text), gs.getNonStrokingColor() has same value for those "1" tokens as for others tokens and is actually black

To recognize this with PDFBox, you have to tell its PDFTextStripper to look for the generic color space selection and color selection operators cs and scn and extend processTextPosition like in this proof-of-concept:

PDFTextStripper stripper = new PDFTextStripper() {
    @Override
    protected void processTextPosition(TextPosition text) {
        PDGraphicsState gs = getGraphicsState();
        PDColor color = gs.getNonStrokingColor();
        float[] currentComponents = color.getComponents();
        if (!Arrays.equals(components, currentComponents)) {
            System.out.print(Arrays.toString(currentComponents));
            components = currentComponents;
        }
        System.out.print(text.getUnicode());
        super.processTextPosition(text);
    }

    float[] components;
};

stripper.addOperator(new SetNonStrokingColorSpace());
stripper.addOperator(new SetNonStrokingColorN());

(ExtractText test testTestSeparation)

With these settings in place you get

[1.0]TenantLeaseStart ... 3,500.00RENT[0.0]11[1.0]16,133.33

As you see the color component starts with 1.0, for the two '1's it is 0.0, and thereafter it becomes 1.0 again until the next run of invisible '1's.

Question:

Lets say I managed to cast a PDTerminalField as an instance of PDPushButton. But looking at the APIs provided I cant guess how to extract the label of said button.

Not adding code due to the verbosity of the application. This is a sample pdf.


Answer:

(Thanks to @Tilman for correcting me here.) There indeed is such an attribute, you can access it via getAppearanceCharacteristics().getNormalCaption(), but this attribute is optional and its contents are not guaranteed to coincide with the visual appearance of the button because the appearance stream may contain different information. Thus a combined strategy of querying the attribute and reading the appearance stream might be called for.

The appearance stream of a button in a PDF can contain any number of graphics and text drawing instructions to paint the button but this stream is not necessarily easy to read or parse. E.g. in case of the sample file provided by the OP, this stream looks like this:

1 0.75 0.666656 rg
0 0 72 20 re
f
q
1 1 70 18 re
W
n
0 g
BT
/HeBo 12 Tf
0 g
6.696 5.857 Td
(My ) Tj
19.992 0 Td
(Button) Tj
ET
Q

Here one can already see the button text, "My Button", but obviously one has to do some parsing to retrieve it (in particular as the text encoding need not be derived from ASCII as it is in this case), one has to apply text extraction to the stream.

Unfortunately the main text extraction work horse in PDFBox, the PdfTextStripper class, is very hard to apply to anything else than page content. Thus, I use a base class the text stripper is derived from, add only minimal text arrangement capabilities, and apply it to the button appearance stream.

import java.io.IOException;

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.contentstream.operator.text.BeginText;
import org.apache.pdfbox.contentstream.operator.text.EndText;
import org.apache.pdfbox.contentstream.operator.text.MoveText;
import org.apache.pdfbox.contentstream.operator.text.MoveTextSetLeading;
import org.apache.pdfbox.contentstream.operator.text.NextLine;
import org.apache.pdfbox.contentstream.operator.text.SetCharSpacing;
import org.apache.pdfbox.contentstream.operator.text.SetFontAndSize;
import org.apache.pdfbox.contentstream.operator.text.SetTextHorizontalScaling;
import org.apache.pdfbox.contentstream.operator.text.SetTextLeading;
import org.apache.pdfbox.contentstream.operator.text.SetTextRenderingMode;
import org.apache.pdfbox.contentstream.operator.text.SetTextRise;
import org.apache.pdfbox.contentstream.operator.text.SetWordSpacing;
import org.apache.pdfbox.contentstream.operator.text.ShowText;
import org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLine;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLineAndSpace;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;

public class SimpleXObjectTextStripper extends PDFStreamEngine {
    public SimpleXObjectTextStripper() {
        addOperator(new BeginText());
        addOperator(new Concatenate());
        addOperator(new DrawObject()); // special text version
        addOperator(new EndText());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new NextLine());
        addOperator(new SetCharSpacing());
        addOperator(new MoveText());
        addOperator(new MoveTextSetLeading());
        addOperator(new SetFontAndSize());
        addOperator(new ShowText());
        addOperator(new ShowTextAdjusted());
        addOperator(new SetTextLeading());
        addOperator(new SetMatrix());
        addOperator(new SetTextRenderingMode());
        addOperator(new SetTextRise());
        addOperator(new SetWordSpacing());
        addOperator(new SetTextHorizontalScaling());
        addOperator(new ShowTextLine());
        addOperator(new ShowTextLineAndSpace());
    }

    public String getText(PDFormXObject form) throws IOException {
        stringBuilder.setLength(0);

        processChildStream(form, new PDPage()); 

        return stringBuilder.toString();
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        stringBuilder.append(unicode);
    }

    final StringBuilder stringBuilder = new StringBuilder();
}

(SimpleXObjectTextStripper)

(I included the import statements as PDFBox contains several classes of similar names here.)

Using this simple custom stripper class, one can extract text content from field appearances like this:

public void showNormalFieldAppearanceTexts(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();

    if (acroForm != null) {
        SimpleXObjectTextStripper stripper = new SimpleXObjectTextStripper();

        for (PDField field : acroForm.getFieldTree()) {
            if (field instanceof PDTerminalField) {
                PDTerminalField terminalField = (PDTerminalField) field;
                System.out.println();
                System.out.println("* " + terminalField.getFullyQualifiedName());
                for (PDAnnotationWidget widget : terminalField.getWidgets()) {
                    PDAppearanceDictionary appearance = widget.getAppearance();
                    if (appearance != null) {
                        PDAppearanceEntry normal = appearance.getNormalAppearance();
                        if (normal != null) {
                            Map<COSName, PDAppearanceStream> streams = normal.isSubDictionary() ? normal.getSubDictionary() :
                                Collections.singletonMap(COSName.DEFAULT, normal.getAppearanceStream());
                            for (Map.Entry<COSName, PDAppearanceStream> entry : streams.entrySet()) {
                                String text = stripper.getText(entry.getValue());
                                System.out.printf("  * %s: %s\n", entry.getKey().getName(), text);
                            }
                        }
                    }
                }
            }
        }
    }
}

(ExtractAppearanceText helper method)

Question:

I am trying to extract all text in a pdf along with their coordinates. I am using Apache PDFBox 2.0.8 and following the sample program DrawPrintTextLocations .

It seems to work mostly, but for certain pdf-s i get negative values for the x and y coordinates of the bounding boxes. Refer this pdf file for example.

My app assumes the coordinate system as a normal pdf (x goes from left to right an y goes top to bottom). so these are throwing my computations off.

Below is the relevant piece of code.

import org.apache.fontbox.util.BoundingBox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType3Font;
import org.apache.pdfbox.pdmodel.interactive.pagenavigation.PDThreadBead;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.geom.AffineTransform;
import java.awt.geom.Rectangle2D;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.List;

/**
 * This is an example on how to get some x/y coordinates of text and to show them in a rendered
 * image.
 *
 * @author Ben Litchfield
 * @author Tilman Hausherr
 */
public class DrawPrintTextLocations extends PDFTextStripper {
    private AffineTransform flipAT;
    private AffineTransform rotateAT;
    private AffineTransform transAT;

    private final float DPI = 200.0f;
    private final double PT2PX = DPI / 72.0;
    private final AffineTransform dpiAT = AffineTransform.getScaleInstance(PT2PX, PT2PX);

    private final String filename;
    static final int SCALE = 1;
    private Graphics2D g2d;
    private final PDDocument document;

    /**
     * Instantiate a new PDFTextStripper object.
     *
     * @param document
     * @param filename
     * @throws IOException If there is an error loading the properties.
     */
    public DrawPrintTextLocations(PDDocument document, String filename) throws IOException {
        this.document = document;
        this.filename = filename;
    }

    /**
     * This will print the documents data.
     *
     * @param args The command line arguments.
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException {
        String pdfLoc = "/debug/pdfbox/p2_VS008PI.pdf";

        if (args.length == 1) {
            pdfLoc = args[0];
        }

        try (PDDocument document = PDDocument.load(new File(pdfLoc))) {
            DrawPrintTextLocations stripper = new DrawPrintTextLocations(document, pdfLoc);
            stripper.setSortByPosition(true);

            for (int page = 0; page < document.getNumberOfPages(); ++page) {
                stripper.stripPage(page);
            }
        }
    }

    private void stripPage(int page) throws IOException {
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        BufferedImage image = pdfRenderer.renderImageWithDPI(page, DPI);

        PDPage pdPage = document.getPage(page);
        PDRectangle cropBox = pdPage.getCropBox();

        // flip y-axis
        flipAT = new AffineTransform();
        flipAT.translate(0, pdPage.getBBox().getHeight());
        flipAT.scale(1, -1);

        // page may be rotated
        rotateAT = new AffineTransform();
        int rotation = pdPage.getRotation();
        if (rotation != 0) {
            PDRectangle mediaBox = pdPage.getMediaBox();
            switch (rotation) {
                case 90:
                    rotateAT.translate(mediaBox.getHeight(), 0);
                    break;
                case 270:
                    rotateAT.translate(0, mediaBox.getWidth());
                    break;
                case 180:
                    rotateAT.translate(mediaBox.getWidth(), mediaBox.getHeight());
                    break;
                default:
                    break;
            }
            rotateAT.rotate(Math.toRadians(rotation));
        }

        // cropbox
        transAT = AffineTransform.getTranslateInstance(-cropBox.getLowerLeftX(), cropBox.getLowerLeftY());

        g2d = image.createGraphics();
        g2d.setStroke(new BasicStroke(0.1f));
        g2d.scale(SCALE, SCALE);

        setStartPage(page + 1);
        setEndPage(page + 1);

        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        writeText(document, dummy);

        g2d.dispose();

        String imageFilename = filename;
        int pt = imageFilename.lastIndexOf('.');
        imageFilename = imageFilename.substring(0, pt) + "-marked-" + (page + 1) + ".png";
        ImageIO.write(image, "png", new File(imageFilename));
    }

    /**
     * Override the default functionality of PDFTextStripper.
     */
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {

        for (TextPosition text : textPositions) {

            AffineTransform at = text.getTextMatrix().createAffineTransform();
            PDFont font = text.getFont();

            BoundingBox bbox = font.getBoundingBox();

            float xadvance = font.getWidth(text.getCharacterCodes()[0]); // todo: should iterate all chars
            Rectangle2D.Float rect1 = new Rectangle2D.Float(0, bbox.getLowerLeftY(), xadvance, bbox.getHeight());

            if (font instanceof PDType3Font) {
                at.concatenate(font.getFontMatrix().createAffineTransform());
            } else {
                at.scale(1 / 1000f, 1 / 1000f);
            }

            Shape s1 = at.createTransformedShape(rect1);
            s1 = flipAT.createTransformedShape(s1);
            s1 = rotateAT.createTransformedShape(s1);
            s1 = dpiAT.createTransformedShape(s1);

            g2d.setColor(Color.blue);
            g2d.draw(s1);

            Rectangle bounds = s1.getBounds();
            if (bounds.getX() < 0 || bounds.getY() < 0) {
                // THIS is where things go wrong
                // i need these coordinates to be +ve
                System.out.println(bounds.toString());
                System.out.println(rect1.toString());
            }
        }
    }
}

And here is some snippet of the output from the first page of the above pdf.

SECTION 10 – INSURANCE & OTHER FINANCIAL RESOURCES java.awt.Rectangle[x=-3237,y=40,width=19,height=43] java.awt.Rectangle[x=-3216,y=40,width=20,height=43] java.awt.Rectangle[x=-3194,y=40,width=23,height=43] java.awt.Rectangle[x=-3170,y=40,width=22,height=43]


Answer:

The characters with negative coordinates are outside the cropbox (also characters with coordinates bigger than the cropbox height / width). See the cropbox as a cutout from something bigger. To see the whole thing, run this code

pdPage.setCropBox(pdPage.getMediaBox());

for each page of your PDF and then save and view it.

Per your comment

Following your advice of setting the crop box to the media box, actually changed the whole on screen appearance of the pdf, now i got 3 pages collated as one.

This suggests that physically, this is a folded sheet that has 3 pages on each side. The online PDF displays this as 6 pages for easy viewing on a computer.

Question:

I am working with a large document, but I have extracted the page giving trouble here. The y-coordinates I get back for the lines in the table seem to be stretched beyond the coordinates of the text. There seems to be some transformation going on, but I cannot find it. If possible I would like to fix the problem within the scope of the PDFGraphicsStreamEngine as extended below, and not have to go back to the drawing board with the other input streams available in PDFBox.

I have extended PDFTextStripper to acquire the location of every text glyph on the page:

public class MyPDFTextStripper extends PDFTextStripper {

    private List<TextPosition> tps;

    public MyPDFTextStripper() throws IOException {
        tps = new ArrayList<>();
    }

    @Override
    protected void writeString
            (String text,
             List<TextPosition> textPositions)
            throws IOException {
        tps.addAll(textPositions);
    }

    List<TextPosition> getTps() {
        return tps;
    }
}

and I have extended PDFGraphicsStreamEngine to extract every line on the page as a Line2D:

public class LineCatcher extends PDFGraphicsStreamEngine
{
    private final GeneralPath linePath = new GeneralPath();
    private List<Line2D> lines;

    LineCatcher(PDPage page)
    {
        super(page);
        lines = new ArrayList<>();
    }

    List<Line2D> getLines() {
        return lines;
    }

    @Override
    public void strokePath() throws IOException
    {
        Rectangle2D rect = linePath.getBounds2D();
        Line2D line = new Line2D.Double(rect.getX(), rect.getY(),
                rect.getX() + rect.getWidth(),
                rect.getY() + rect.getHeight());
        lines.add(line);
        linePath.reset();
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {linePath.moveTo(x, y);}
    @Override
    public void lineTo(float x, float y) throws IOException
    {linePath.lineTo(x, y);}
    @Override
    public Point2D getCurrentPoint() throws IOException
    {return linePath.getCurrentPoint();}

    //all other overridden methods can be left empty for the purposes of this problem.
}

I have written a simple program to demonstrate the problem:

public class PageAnalysis {
    public static void main(String[] args) {
        try (PDDocument doc = PDDocument.load(new File("onePage.pdf"))) {
            PDPage page = doc.getPage(0);

            MyPDFTextStripper ts = new MyPDFTextStripper();
            ts.getText(doc);
            List<TextPosition> tps = ts.getTps();

            System.out.println("Y coordinates in text:");
            Set<Integer> ySet = new HashSet<>();
            for (TextPosition tp: tps) {
                ySet.add((int)tp.getY());
            }
            List<Integer> yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();


            System.out.println("Y coordinates in lines:");
            LineCatcher lineCatcher = new LineCatcher(page);
            lineCatcher.processPage(page);
            List<Line2D> lines = lineCatcher.getLines();
            ySet = new HashSet<>();
            for (Line2D line: lines) {
                ySet.add((int)line.getY2());
            }
            yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

The output from this is:

Y coordinates in text:
66  79  106 118 141 153 171 189 207 225 243 261 279 297 315 333 351 370 388 406 424 442 460 478 496 514 780 
Y coordinates in lines:
322 340 358 376 394 412 430 448 466 484 502 520 538 556 574 593 611 629 647 665 683 713

The last number in the text list corresponds to the y-coordinate of the page number at the bottom. I cannot find what is going on with the y-coordinates of the lines, though it seems to be those which have been transformed (the media box is the same here as it was for the text, and it fits in with the text positions). The current transformation matrix has 1.0 for yScaling also.


Answer:

Indeed, the PDFTextStripper has the bad habit of transforming coordinates into a very un-PDF'ish coordinate system, one with the origin in the upper left of the page and y coordinates increasing downwards.

For a TextPosition tp, therefore, you should not use

tp.getY()

but instead

tp.getTextMatrix().getTranslateY()

Unfortunately these coordinates still may be translated even though they are nearer to the actual PDF default coordinate system, cf. this answer: These coordinates still are transformed to have the origin in the lower left corner of the crop box.

Thus, you really need something like this:

tp.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY()

where cropBox is the PDRectangle retrieved as

PDRectangle cropBox = doc.getPage(n).getCropBox();

where in turn n is the number of the page with that content.

Question:

I am trying to retrieve some input fields from a PDF (2.0.7) with PDFBox without success.

In detail, i have the following (PDF available here: https://file.io/q56S4r or here http://s000.tinyupload.com/index.php?file_id=38385451581058382678 ) . The current PDF contains 3 textfields with the same name "Text1" . In addition Acrobat PRO represents those fields like seen in the screenshot from Acrobat Pro:

The following code below instead of retrieving 3 fields, it returns a list with just this object "Text1{type: PDTextField value: null}"

PDDocument pdfDocument = PDDocument.load(input);
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDField> fields = acroForm.getFields();
for (PDField field : fields) {
   ...
}

Is there a way to read in some way all fields even if they have the same name? Is this a bad practice maybe and the correct answer is to give unique names?


Answer:

Technically the PDF has only one field defined. If you open the PDF in Acrobat Reader and enter a value in one of the fields, the other two fields are immediately filled with the same value. If you really want three different values, you need to specify a unique name for each of the fields.

Question:

I've been doing some experiments on pdfbox and I'm currently stuck on a issue which I suspect has something to do with coordinate system. I'm extending PDFTextStripper to get the X and Y of each character in a pdf page. Originally I was creating an Image with ImageIO printing the text at the position I received, and putting a little mark (rectangles with different colors) on the bottom of each reference I wanted, and everything seemed well. But now to avoid losing the style from the pdf I just wanted to overlay the pdf and adding the previously spoken marks, but the coordinates I got don't match in PDPageContentStream. Any help on matching pdf coordinates I get from PDFTextStripper -> processTextPosition to the visual coordinates

Using version 1.8.11


Answer:

As discussed in the comments, this is the 1.8 version of the DrawPrintTextLocations tool that is part of the examples collections of the 2.0 version and which is based on the better known PrintTextLocations example. Unlike the 2.0 version, this one does not show the font bounding boxes, only the text extraction sizes, which is about the height of a small glyph (a, e, etc). It is used as an heuristic tool for text extraction. That is the cause for the "the textpositions i'm getting are halfed" effect here. If you need bounding boxes, better use 2.0 (which may be too big). To get exact sizes, you would have to calculate the path of each glyph and get the bounds of that one, again, you'd need the 2.0 version for that one.

public class DrawPrintTextLocations extends PDFTextStripper
{
    private BufferedImage image;
    private final String filename;
    static final int SCALE = 4;
    private Graphics2D g2d;
    private final PDDocument document;

    /**
     * Instantiate a new PDFTextStripper object.
     *
     * @param document
     * @param filename
     * @throws IOException If there is an error loading the properties.
     */
    public DrawPrintTextLocations(PDDocument document, String filename) throws IOException
    {
        this.document = document;
        this.filename = filename;
    }

    /**
     * This will print the documents data.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            usage();
        }
        else
        {
            PDDocument document = null;
            try
            {
                document = PDDocument.load(new File(args[0]));

                DrawPrintTextLocations stripper = new DrawPrintTextLocations(document, args[0]);
                stripper.setSortByPosition(true);

                for (int page = 0; page < document.getNumberOfPages(); ++page)
                {
                    stripper.stripPage(page);
                }
            }
            finally
            {
                if (document != null)
                {
                    document.close();
                }
            }
        }
    }

    private void stripPage(int page) throws IOException
    {
        PDPage pdPage = (PDPage) document.getDocumentCatalog().getAllPages().get(page);
        image = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 72 * SCALE);
        PDRectangle cropBox = pdPage.getCropBox();

        g2d = image.createGraphics();
        g2d.setStroke(new BasicStroke(0.1f));
        g2d.scale(SCALE, SCALE);

        setStartPage(page + 1);
        setEndPage(page + 1);

        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        writeText(document, dummy);

        // beads in green
        g2d.setStroke(new BasicStroke(0.4f));
        List<PDThreadBead> pageArticles = pdPage.getThreadBeads();
        for (PDThreadBead bead : pageArticles)
        {
            PDRectangle r = bead.getRectangle();
            GeneralPath p = transform(r, Matrix.getTranslatingInstance(-cropBox.getLowerLeftX(), cropBox.getLowerLeftY()));
            AffineTransform flip = new AffineTransform();
            flip.translate(0, pdPage.findCropBox().getHeight());
            flip.scale(1, -1);
            Shape s = flip.createTransformedShape(p);
            g2d.setColor(Color.green);
            g2d.draw(s);
        }

        g2d.dispose();

        String imageFilename = filename;
        int pt = imageFilename.lastIndexOf('.');
        imageFilename = imageFilename.substring(0, pt) + "-marked-" + (page + 1) + ".png";
        ImageIO.write(image, "png", new File(imageFilename));
    }

    /**
     * Override the default functionality of PDFTextStripper.
     */
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException
    {
        for (TextPosition text : textPositions)
        {
            System.out.println("String[" + text.getXDirAdj() + ","
                    + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                    + text.getXScale() + " height=" + text.getHeightDir() + " space="
                    + text.getWidthOfSpace() + " width="
                    + text.getWidthDirAdj() + "]" + text.getCharacter());

            // in red:
            // show rectangles with the "height" (not a real height, but used for text extraction 
            // heuristics, it is 1/2 of the bounding box height and starts at y=0)
            Rectangle2D.Float rect = new Rectangle2D.Float(
                    text.getXDirAdj(),
                    (text.getYDirAdj() - text.getHeightDir()),
                    text.getWidthDirAdj(),
                    text.getHeightDir());
            g2d.setColor(Color.red);
            g2d.draw(rect);
        }
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println("Usage: java " + DrawPrintTextLocations.class.getName() + " <input-pdf>");
    }


    /**
     * Transforms the given point by this matrix.
     *
     * @param x x-coordinate
     * @param y y-coordinate
     */
    private Point2D.Float transformPoint(Matrix m, float x, float y)
    {
        float[][] values = m.getValues();
        float a = values[0][0];
        float b = values[0][1];
        float c = values[1][0];
        float d = values[1][1];
        float e = values[2][0];
        float f = values[2][2];
        return new Point2D.Float(x * a + y * c + e, x * b + y * d + f);
    }

    /**
     * Returns a path which represents this rectangle having been transformed by the given matrix.
     * Note that the resulting path need not be rectangular.
     */
    private GeneralPath transform(PDRectangle r, Matrix matrix)
    {
        float x1 = r.getLowerLeftX();
        float y1 = r.getLowerLeftY();
        float x2 = r.getUpperRightX();
        float y2 = r.getUpperRightY();

        Point2D.Float p0 = transformPoint(matrix, x1, y1);
        Point2D.Float p1 = transformPoint(matrix, x2, y1);
        Point2D.Float p2 = transformPoint(matrix, x2, y2);
        Point2D.Float p3 = transformPoint(matrix, x1, y2);

        GeneralPath path = new GeneralPath();
        path.moveTo((float) p0.getX(), (float) p0.getY());
        path.lineTo((float) p1.getX(), (float) p1.getY());
        path.lineTo((float) p2.getX(), (float) p2.getY());
        path.lineTo((float) p3.getX(), (float) p3.getY());
        path.closePath();
        return path;
    }

}

Question:

I'm trying to find location of text elements in PDF. I've extended PDFTextStripper for this purpose. I'm using multi-page LaTeX-produced PDF for testing.

public class TextFinder extends PDFTextStripper {
    private static final Logger logger =
        LoggerFactory.getLogger(TextFinder.class);

    private PDRectangle mediaBox;

    public static class CMProcessor extends OperatorProcessor {

        @Override
        public void process(PDFOperator operator, List<COSBase> arguments)
                throws IOException {

            if ("cm".equals(operator.getOperation())) {
                logger.debug("CM operation");
            }
        }
    }

    private CMProcessor cmProcessor = new CMProcessor();

    public TextFinder() throws IOException {
        this.registerOperatorProcessor("cm", cmProcessor);
    }

    @Override
    protected void startPage(PDPage page) throws IOException {
        super.startPage(page);
        mediaBox = page.findMediaBox();
        logger.debug(String.format("MEDIA (%f,%f) (%f,%f)",
            mediaBox.getLowerLeftX(), mediaBox.getLowerLeftY(),
            mediaBox.getUpperRightX(), mediaBox.getUpperRightY()));
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions)
            throws IOException {
        for (TextPosition position : textPositions) {
            float x = position.getXDirAdj();
            float y = mediaBox.getHeight() - position.getYDirAdj();
            logger.debug(String.format("(%f,%f) (%f,%f)", x, y,
                x + position.getWidthDirAdj(), y + position.getHeightDir()));
        }
        super.writeString(text, textPositions);
    }
}

The problem I'm facing is that all positions seems to be translated such a way that (0, 0) is the coordinates of the leftmost topmost text element:

MEDIA (0.000000,0.000000) (595.270020,841.890015)
(0.000000,0.000000) (11.486961,14.255401)
(11.486961,0.000000) (20.660002,14.255401)
(20.660002,0.000000) (36.733482,14.255401)

Thanks to mkl, the problem is caused by custom OperatorProcessor. Without it it works just fine. But I need operator processor, because I use it for finding images. Still I don't quite understand, why adding custom processor affects behavior of PDFTextStripper.


Answer:

why adding custom processor affects behavior of PDFTextStripper.

The reason is that your adding

registerOperatorProcessor("cm", cmProcessor);

actually replaces the existing Processor. The original one was important for determining the correct position on page.

The solution is to chain processors.

Question:

I am using Apache PDFBox for configuration of PDTextField's on a PDF document where I load Lato onto the document using:

font = PDType0Font.load(
    @j_pd_document,
    java.io.FileInputStream.new('/path/to/Lato-Regular.ttf')
) # => Lato-Regular

font_name = pd_default_resources.add(font).get_name # => F4

I then pass the font_name into a default_appearance_string for the PDTextField like so:

j_text_field.set_default_appearance("/#{font_name} 0 Tf 0 g") # where font_name is
                                                              # passed in from above

The issue now occurs when I proceed to invoke setValue on the PDTextField. Because I set the font_size in the defaultAppearanceString to 0, according to the library's example, the text should scale itself to fit in the text box's given area. However, the behaviour of this 'scale-to-fit' is inconsistent for certain fields: it does not always choose the largest font size to fit in the PDTextField. Might there be any further configuration that might allow for this to happen? Below are the PDFs where I've noticed this problem occurring.

Unfilled, with fonts loaded: http://www.filedropper.com/0postfontload

Filled, with inconsisteny textbox text sizing: http://www.filedropper.com/file_327

Side Note: I am using PDFBox through jruby which is just a integration layer that allows Ruby to invoke Java libraries. All java methods for the library available; a java method like thisExampleMethod would have a one-to-one translation into ruby this_example_method.


Updates

In response to comments, the appearances that are incorrect in the second uploaded file example are:

  • 1st page Resident Name field (two text fields that have text that is too small for the given input field size)
  • 2nd page Phone fields (four text fields that have text that overflows the given input field size)

Answer:

Especially the appearances of the Resident Name fields, the Phone fields, and the Care Providers Address fields appear conspicuous. Only the former two are mentioned by the OP.

Let's inspect these fields; all screen shots are made using Adobe Reader DC on MS Windows:

The Resident Name fields

The filled in Resident Name fields look like this

While the height is appropriate, the glyphs are narrower than they should be. Actually this effect can already be seen in the original PDF:

This horizontal compression is caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:

  • The widget rectangles: [ 45.72 601.44 118.924 615.24 ] and [ 119.282 601.127 192.486 614.927 ], i.e. 73.204*13.8 in both cases.
  • The appearance bounding box: [ 0 0 147.24 13.8 ], i.e. 147.24*13.8.

So they have the same height but the appearance bounding box is approximately twice as wide as the widget rectangle. Thus, the text drawn normally in the appearance stream gets compressed to half its width when the appearance is displayed in the widget rectangle.

When setting the value of a field PDFBox unfortunately re-uses the appearance stream as is and only updates details from the default appearance, i.e. font name, font size, and color, and the actual text value, apparently assuming the other properties of the appearance are as they are for a reason. Thus, the PDFBox output also shows this horizontal compression

To make PDFBox create a proper appearance, it is necessary to remove the old appearances before setting the new value.

The Phone fields

The filled in Phone fields look like this

and again there is a similar display in the original file

That only the first two letters are shown even though there is enough space for the whole word, is due to the configuration of these fields: They are configured as comb fields with a maximum length of 2 characters.

To have a value here set with PDFBox displayed completely and not so spaced out, you have to remove the maximum length (or at least have to make it no less than the length of your value) and unset the comb flag.

The Care Providers Address fields

Filled in they look like this:

Originally they look similar:

This vertical compression is again caused by the field widget rectangles having a different aspect ratio than the respectively matching normal appearance stream bounding box:

  • A widget rectangle: [ 278.6 642.928 458.36 657.96 ], i.e. 179.76*15.032.
  • The appearance bounding box: [ 0 0 179.76 58.56 ], i.e. 179.76*58.56.

Just like in the case of the Resident Name fields above it is necessary to remove the old appearances before setting the new value to make PDFBox create a proper appearance.

A complication

Actually there is an additional issue when filling in the Care Providers Address fields, after removing the old appearances they look like this:

This is due to a shortcoming of PDFBox: These fields are configured as multi line text fields. While PDFBox for single line text fields properly calculates the font size based on the content and later finely makes sure that the text vertically fits quite well, it proceeds very crudely for multi line fields, it selects a hard coded font size of 12 and does not fine tune the vertical position, see the code of the AppearanceGeneratorHelper methods calculateFontSize(PDFont, PDRectangle) and insertGeneratedAppearance(PDAnnotationWidget, PDAppearanceStream, OutputStream).

As in your form these address fields anyways are only one line high, an obvious solution would be to make these fields single line fields, i.e. clear the Multiline flag.

Example code

Using Java one can implement the solutions explained above like this:

final int FLAG_MULTILINE = 1 << 12;
final int FLAG_COMB = 1 << 24;

PDDocument doc = PDDocument.load(originalStream);
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();

PDType0Font font = PDType0Font.load(doc, fontStream, false);
String font_name = acroForm.getDefaultResources().add(font).getName();

for (PDField field : acroForm.getFieldTree()) {
    if (field instanceof PDTextField) {
        PDTextField textField = (PDTextField) field;
        textField.getCOSObject().removeItem(COSName.MAX_LEN);
        textField.getCOSObject().setFlag(COSName.FF, FLAG_COMB | FLAG_MULTILINE, false);;
        textField.setDefaultAppearance(String.format("/%s 0 Tf 0 g", font_name));
        textField.getWidgets().forEach(w -> w.getAppearance().setNormalAppearance((PDAppearanceEntry)null));
        textField.setValue("Test");
    }
}

(FillInForm test testFill0DropOldAppearanceNoCombNoMaxNoMultiLine)

Screen shots of the output of the example code

The Resident Name field value now is not vertically compressed anymore:

The Phone and Care Providers Address fields also look appropriate now:

Question:

I'm creating a new pdf using pdfbox, and want to set tooltip for pdtextfield on mouse hover.

On the official doc there is getToolTip() method but i didn't found a set method.

here is the expected output:


Answer:

You wrote that you create the PDF from scratch so you do also create the textfield (and don't want to add a tooltip to an existing textfield). Since you didn't post code nor mention the PdfBox version you are using I have nothing to go on, but in general you could do it like this (The TU key is the attribute used in acrobat as a tooltip and it is called alternate field name):

    PDDocument doc = new PDDocument();

    PDTextField textbox = new PDTextField(doc.getDocumentCatalog().getAcroForm());
    textbox.setAlternateFieldName("Your tooltip text");
    textbox.set...  //(set all the other attributes)

This code assumes that you are using PdfBox in version 2.X. In 1.8.X you have to do a bit more...

Question:

Link to example PDF: click here. Here you can see that many labels in the left are clipped (because of some clipping instructions)

When I use PDFTextStripper, it prints all text which is actually cut/hidden in example PDF file. I have already tried solution described here however it makes it even worth because removes much text in the top + some text in the beginning of each row. Is there any other way to show only visible characters, and skip all overlapped, using PDFBox? Or maybe is there any other tool which could return only visible text? Thanks in advance.


Answer:

The reason why the PDFVisibleTextStripper from this answer the OP referenced does not work is that the calculation of the end of a character baseline end in the overwritten processTextPosition does not take page rotation into account. If you change that method, though, to only test the start of each character baseline and ignore the end, it works fairly good for the document at hand:

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));

    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(start.getX(), start.getY()))
        super.processTextPosition(text);
}

With this processTextPosition override the result of the text extraction (with SortByPosition set to true) is:

Profit & Loss 12 Month Recap
Property: 8151 W. 183rd Street
Monthly recap 05/01/16 - 04/30/17  (cash basis)
MAY 16 JUN 16 JUL 16 AUG 16 SEP 16 OCT 16 NOV 16 DEC 16 JAN 17 FEB 17 MAR 17 APR 17 TOTAL
INCOME
    4000 RENTAL INCOME
        4001 Base Rent 343,002.59 38,045.11 38,045.11 38,045.11 66,081.36 122,153.86 66,081.36 38,045.11 0.00 76,090.22 38,598.49 66,634.74 930,823.06
        4004 Prepaid Rent Inco -165,742.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 38,045.11 -38,045.11 0.00 0.00 -165,742.50
        4000 Total RENTAL INC 177,260.09 38,045.11 38,045.11 38,045.11 66,081.36 122,153.86 66,081.36 38,045.11 38,045.11 38,045.11 38,598.49 66,634.74 765,080.56
    4200 INCOME CHARGEB
        4205 Property Tax Reco 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 3,696.62 4,250.00 50,446.62
        4210 CAM Recoveries 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 57,000.00
        4200 Total INCOME CH 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 8,446.62 9,000.00 107,446.62
    4600 OTHER INCOME
        4610 Late / NSF Fees 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,394.72 3,828.61 0.00 0.00 0.00 5,223.33
        4600 Total OTHER INC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,394.72 3,828.61 0.00 0.00 0.00 5,223.33
TOTAL INCOME 186,260.09 47,045.11 47,045.11 47,045.11 75,081.36 131,153.86 75,081.36 48,439.83 50,873.72 47,045.11 47,045.11 75,634.74 877,750.51
EXPENSE
    6000 PROFESSIONAL FE
        6010 Professional Fees 0.00 0.00 0.00 2,500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2,500.00
        6020 Legal Fees 0.00 0.00 0.00 4,592.71 0.00 1,466.33 1,703.35 2,006.00 0.00 685.96 4,368.50 0.00 14,822.85
        6000 Total PROFESSIO 0.00 0.00 0.00 7,092.71 0.00 1,466.33 1,703.35 2,006.00 0.00 685.96 4,368.50 0.00 17,322.85
    6100 UTILITIES
        6105 Water & Sewer 0.00 0.00 0.00 21.21 0.00 0.00 25.81 0.00 0.00 31.91 0.00 0.00 78.93
        6110 Electricity 1,000.91 358.23 390.43 350.71 353.69 0.00 666.39 381.97 486.85 449.62 480.21 486.81 5,405.82
        6125 Trash Removal 229.54 231.34 232.56 232.78 231.66 240.94 240.94 241.40 241.40 518.97 259.18 0.00 2,900.71
        6100 Total UTILITIES 1,230.45 589.57 622.99 604.70 585.35 240.94 933.14 623.37 728.25 1,000.50 739.39 486.81 8,385.46
    6200 REPAIR & MAINTEN
        6210 Field & Grounds - 3,094.00 0.00 0.00 2,313.84 1,009.50 0.00 1,439.58 1,302.75 600.00 0.00 0.00 1,909.73 11,669.40
        6211 Irrigation / Sprinkle 0.00 0.00 0.00 0.00 0.00 1,121.08 350.00 0.00 0.00 0.00 0.00 0.00 1,471.08
        6215 Landscape / Lawn 565.71 565.71 565.71 565.71 565.71 565.71 1,165.71 0.00 0.00 0.00 0.00 495.00 5,054.97
        6220 Sanitary Sewers 0.00 0.00 0.00 950.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 950.00
        6221 Storm Drains 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2,500.00 0.00 2,500.00
        6223 Snow Removal 1,365.00 3,440.00 0.00 0.00 0.00 0.00 0.00 1,350.00 4,440.00 4,106.00 790.00 2,340.00 17,831.00
        6228 Ceiling Tiles 0.00 0.00 0.00 0.00 53.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 53.30
        6231 Building - General 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 634.65 634.65
        6233 Roof / Flashing 1,840.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 764.00 0.00 2,604.00
        6234 Electrical Repairs 0.00 0.00 0.00 395.00 0.00 0.00 960.00 90.00 0.00 0.00 0.00 0.00 1,445.00
        6236 Plumbing Repairs 0.00 0.00 3,316.59 0.00 2,315.95 0.00 930.00 812.17 0.00 0.00 0.00 0.00 7,374.71
        6237 Fire & Life Safety 0.00 0.00 0.00 0.00 0.00 150.00 0.00 0.00 660.00 0.00 0.00 1,550.00 2,360.00
        6238 Lighting Supplies 0.00 0.00 0.00 0.00 0.00 0.00 875.00 193.05 0.00 0.00 0.00 0.00 1,068.05
Profit & Loss 12 Month Recap          05/02/17 11:13 AM Page 1 of rentmanager.com - property management systems   rev.12.180
MAY 16 JUN 16 JUL 16 AUG 16 SEP 16 OCT 16 NOV 16 DEC 16 JAN 17 FEB 17 MAR 17 APR 17 TOTAL
        6240 Lock & Key 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.59 0.00 0.00 0.00 14.59
        6242 HVAC Expense 4,375.00 0.00 1,370.00 2,043.25 0.00 0.00 0.00 415.00 1,326.00 1,835.00 0.00 0.00 11,364.25
        6251 Pest Control 0.00 71.07 0.00 71.07 0.00 0.00 71.07 71.07 0.00 71.07 0.00 71.07 426.42
        6200 Total REPAIR & M 11,239.71 4,076.78 5,252.30 6,338.87 3,944.46 1,836.79 5,791.36 4,234.04 7,040.59 6,012.07 4,054.00 7,000.45 66,821.42
    6300 JANITORIAL
        6310 Janitorial Services 1,935.00 1,935.00 1,935.00 1,935.00 1,935.00 0.00 3,870.00 1,935.00 1,935.00 1,935.00 1,995.00 1,995.00 23,340.00
        6320 Janitorial Supplies 79.74 260.01 79.74 90.84 113.14 0.00 170.58 0.00 365.61 90.84 0.00 153.01 1,403.51
        6300 Total JANITORIAL 2,014.74 2,195.01 2,014.74 2,025.84 2,048.14 0.00 4,040.58 1,935.00 2,300.61 2,025.84 1,995.00 2,148.01 24,743.51
    6400 PAYROLL
        6410 P/R Salaries - Offi 2,167.72 2,190.43 2,213.14 2,213.14 1,512.40 2,342.28 2,224.93 2,107.58 2,107.58 2,107.58 2,190.78 2,344.16 25,721.72
        6412 P/R Taxes - Office 179.87 167.56 169.30 169.30 115.70 179.18 170.21 161.23 238.16 231.10 199.89 196.42 2,177.92
        6420 Employee Insuran 76.06 76.14 76.22 199.23 104.30 161.06 152.29 137.91 139.14 139.14 143.91 175.02 1,580.42
        6421 Employee Benefit 3.54 2.40 87.37 141.59 35.59 114.13 111.50 110.15 89.47 107.81 114.80 49.60 967.95
        6423 Workers Compens 42.50 42.94 37.74 32.10 21.93 33.96 32.26 30.56 30.56 30.56 31.76 33.98 400.85
        6400 Total PAYROLL 2,469.69 2,479.47 2,583.77 2,755.36 1,789.92 2,830.61 2,691.19 2,547.43 2,604.91 2,616.19 2,681.14 2,799.18 30,848.86
    6500 TAXES INSURANCE
        6510 Real Estate Tax E 69,570.07 0.00 0.00 0.00 0.00 69,570.07 0.00 0.00 0.00 0.00 0.00 0.00 139,140.14
        6520 Insurance Expens 2,078.00 2,704.50 0.00 2,704.50 0.00 0.00 2,704.50 0.00 0.00 2,704.50 0.00 0.00 12,896.00
        6500 Total TAXES INSU 71,648.07 2,704.50 0.00 2,704.50 0.00 69,570.07 2,704.50 0.00 0.00 2,704.50 0.00 0.00 152,036.14
    6600 Property Manageme 9,575.44 8,381.70 2,117.03 2,117.03 2,117.03 3,378.66 5,901.92 3,378.66 2,179.79 2,000.00 3,829.06 2,117.03 47,093.35
    6650 Receiver Fees 6,625.00 6,125.00 0.00 0.00 6,875.00 0.00 7,062.50 8,375.00 0.00 0.00 8,875.00 0.00 43,937.50
    6700 GENERAL & ADMIN
        6710 PM / Work Order S 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 1,140.00
        6720 Postage / Messen 63.58 0.00 7.59 9.64 20.63 5.98 6.99 0.00 17.38 7.21 14.36 10.98 164.34
        6725 Office Supplies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 148.88 148.88
        6735 Office Equipment 0.00 0.00 0.00 0.00 0.00 0.00 0.00 218.40 0.00 0.00 0.00 0.00 218.40
        6740 Telephone 21.33 0.00 11.54 15.00 21.12 8.76 9.77 0.00 13.19 11.96 3.14 7.88 123.69
        6760 Auto Mileage & Ex 100.44 0.00 68.75 140.24 104.14 61.29 142.59 29.00 56.04 0.00 23.14 0.00 725.63
        6770 Leasing & Maint. O 0.00 0.00 0.00 0.00 0.00 0.00 75.00 0.00 0.00 0.00 0.00 0.00 75.00
        6780 Bank Fees 129.45 0.00 0.00 105.91 87.62 0.00 53.61 0.00 120.92 56.46 77.49 79.74 711.20
        6700 Total GENERAL & 409.80 95.00 182.88 365.79 328.51 171.03 382.96 342.40 302.53 170.63 213.13 342.48 3,307.14
TOTAL EXPENSE 105,212.90 26,647.03 12,773.71 24,004.80 17,688.41 79,494.43 31,211.50 23,441.90 15,156.68 17,215.69 26,755.22 14,893.96 394,496.23
NOI 81,047.19 20,398.08 34,271.40 23,040.31 57,392.95 51,659.43 43,869.86 24,997.93 35,717.04 29,829.42 20,289.89 60,740.78 483,254.28
N/O EXPENSE
    7100 NON-OPERATING E
        7110 Lease Commissio 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 0.00 0.00 0.00 33,203.00
        7130 Professional Fees 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,276.00 0.00 0.00 1,276.00
        7100 Total NON-OPER 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 1,276.00 0.00 0.00 34,479.00
TOTAL N/O EXPENSE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 1,276.00 0.00 0.00 34,479.00
NET INCOME 81,047.19 20,398.08 34,271.40 23,040.31 57,392.95 51,659.43 43,869.86 24,997.93 2,514.04 28,553.42 20,289.89 60,740.78 448,775.28
Profit & Loss 12 Month Recap          05/02/17 11:13 AM Page 2 of rentmanager.com - property management systems   rev.12.180

At first glance the only visible text missing is the total number of pages in the footers of both pages.


As said by the OP in a comment

It seems same thing should be applied in deleteCharsInPath()

Indeed, deleteCharsInPath should also be changed to:

void deleteCharsInPath() {
    for (List<TextPosition> list : charactersByArticle) {
        List<TextPosition> toRemove = new ArrayList<>();
        for (TextPosition text : list) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            if (linePath.contains(start.getX(), start.getY())) {
                toRemove.add(text);
            }
        }
        if (toRemove.size() != 0) {
            System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
            list.removeAll(toRemove);
        }
    }
}

The OP presented another document in which even the PDFVisibleTextStripper as corrected above failed to properly recognize the visible characters.

The cause is another normalization by PDFBox text stripping moving the origin into the lower left corner of the crop box.

Patching the PDFVisibleTextStripper methods to add the lower left crop box coordinate values again results in a decent extraction of visible text.

Overriding processPage allows us to read the lower left crop box coordinates:

float lowerLeftX = 0;
float lowerLeftY = 0;

@Override
public void processPage(PDPage page) throws IOException {
    PDRectangle pageSize = page.getCropBox();

    lowerLeftX = pageSize.getLowerLeftX();
    lowerLeftY = pageSize.getLowerLeftY();

    super.processPage(page);
}

processTextPosition and deleteCharsInPath need to take these values into account:

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));

    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
        super.processTextPosition(text);
}

[...]

void deleteCharsInPath() {
    for (List<TextPosition> list : charactersByArticle) {
        List<TextPosition> toRemove = new ArrayList<>();
        for (TextPosition text : list) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            if (linePath.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY())) {
                toRemove.add(text);
            }
        }
        if (toRemove.size() != 0) {
            System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
            list.removeAll(toRemove);
        }
    }
}

Now the extraction result is ok for the new file, too. ;)

Question:

I am currently converting a text document to pdf and rendering it to the browser and cannot seem to keep the font. The font is courier but gets converted to something else when it is converted to a pdf. Is there a easy way to just make it keep the default font? Or at least be able to set it after converting? here is the code.

public void downloadFile(HttpServletResponse response, List<Report> reports) throws IOException{
    OutputStream outputStream = response.getOutputStream();
    PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
    PDDocument documentToPrint = new PDDocument();
    for(Report report : reports){
        PDDocument pdDocument = new TextToPDF().createPDFFromText(new InputStreamReader(
                new FileInputStream(fileDirectory + File.separator + report.getFileLocation()), "UTF8")
        );
        pdfMergerUtility.appendDocument(documentToPrint, pdDocument);
    }
    pdfMergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

    response.setContentType("application/pdf");
    response.addHeader("Content-Disposition", "inline; filename=" + "download.pdf");
    documentToPrint.save(outputStream);
    documentToPrint.close();
}

I have also tried setting it like the following before appending the document.

    PDDocumentCatalog documentCatalog = pdDocument.getDocumentCatalog();
    PDResources pdResources = documentCatalog.getPages().get(i).getResources();
    pdResources.add(PDType1Font.COURIER);
    documentCatalog.getPages().get(i++).setResources(pdResources);

But that does not seem to work either


Answer:

Because I have a the font in the text document as courier.

No you don't, editors usually display it with Courier. So you have to set it because the default is Helvetica.

Change this:

PDDocument pdDocument = new TextToPDF().createPDFFromText(new InputStreamReader(....

to this:

TextToPDF textToPDF = new TextToPDF();
textToPDF.setFont(PDType1Font.COURIER);
textToPDF.createPDFFromText(new InputStreamReader(....

Question:

I am using the Apache PDFBox java library to create PDFs. However, I am facing problem in rendering multi-line text (line wrap):

//Creating PDF document object 
PDDocument doc = new PDDocument();

//Adding the blank page to the document
doc.addPage( new PDPage() );

PDPage page = doc.getPage(0);
PDImageXObject pdImage = PDImageXObject.createFromFile("C:\\Users\\abc\\Desktop\\abc.png", doc);

PDPageContentStream contentStream = new PDPageContentStream(doc, page); 
contentStream.drawImage(pdImage, 10, 720);

//Begin the Content stream 
contentStream.beginText(); 
contentStream.newLineAtOffset(50, 735);

//Setting the font to the Content stream
contentStream.setFont( PDType1Font.TIMES_ROMAN, 16 );
contentStream. showText("ABC Management System");

//Setting the leading
//contentStream.setLeading(14.5f);

//Setting the position for the line
contentStream.newLineAtOffset(25, 600);

String text1 = "This is an example of adding text to a page in the pdf document we can add as many lines";
String text2 = "as we want like this using the ShowText()  method of the ContentStream class";

//Adding text in the form of string
contentStream. showText(text1);
contentStream.newLine();
contentStream. showText(text2);

//Ending the content stream
contentStream.endText();

System.out.println("Content added");

//Closing the content stream
contentStream.close();

//Saving the document
doc.save("my_doc.pdf");

System.out.println("PDF created");  

//Closing the document  
doc.close();

The issue I am faciing is the latter half of the text (text1, text2) is not getting rendered in the PDF file. Only the image and the first line ABC Management System is displayed in the pdf.

To generate multi-line texts, I referred: PDFBox - Adding Multiple Lines.

I do not understand what setLeading does and hence commented it out and tried again but the text was still not getting rendered.


Answer:

newLineAtOffset() is relative to the current text position. To restart from 0, the easiest is to end your current and start a new text block. Your current code places you at y 1335 (or slightly lower, depending of the leading).

Question:

I am new to PDFBox API. I would like to apply text annotation(AirPassengers) style like below marked in red box.

I am using PDF box API. I am creating text annotation as shown below.

PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_FREETEXT);

This will result in Simple Text Annotation without any Style or background color. I would like to achieve the style as shown in screenshot. Anybody has any idea to achieve this.


Answer:

Do this:

txtMark.setColor(new PDColor(new float[] { 0, 1, 1 }, PDDeviceRGB.INSTANCE));

this sets the color you mentioned (#00FFFF). In Adobe Acrobat, colors are between 0 and 1 and not between 0 and 255. Be aware that the annotation will be visible in Adobe Reader, but at this time not in PDFBox rendering or PDF.js rendering because the Appearance Stream is missing (see my comment in your previous question).

Question:

I am trying to create PDF file using Apache PDFBox and the content is plain text message of 80 chars per line. When I tried to create the PDF , I noticed that the space , _ and other characters occupy different amount of width on the line and I couldn't format them equally like how we do in text editors. Can somebody help me to format them ?

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.font.PDType1Font;


public class SampleTest {
    public static void main(String[] args) throws Exception {
        PDDocument document = new PDDocument();     
        PDPage page = new PDPage(PDRectangle.A4);
        document.addPage(page);     
        String text =   "---------------------------\n" +
                        "------------ABC------------\n" +
                        "A                         B\n" +
                        "***************************\n" +   
                        "A  B  C  D  E  F  G  H  I  \n" +
                        "---------------------------";
        String[] textArray = text.split("\n");

        PDPageContentStream contentStream = new PDPageContentStream(document, page);        
        contentStream.beginText();      
        contentStream.setFont( PDType1Font.TIMES_ROMAN, 10 );
        contentStream.setLeading(4f);
        contentStream.newLineAtOffset(40, 750);  

        for(String line : textArray){
            contentStream.showText(line);
            contentStream.newLine();
            contentStream.newLine();    
        }
        contentStream.endText();
        contentStream.close();

        document.save("C:\\PDF-Samples\\file.pdf");
        document.close();
        System.out.println("Done");
    }
}

Answer:

To do that you need to change the font to a fixed-width one (like PDType1Font.html.COURIER).

Some reading: https://en.wikipedia.org/wiki/Monospaced_font


To compute the text width, you can use:

// result in unit of text space, suitable to use with moveTextPositionByAmount()
(font.getStringWidth(text) / 1000.0f) * (1.0f * fontSize) // font is a PDFont

Question:

I've located a region of interest in the page by tracking TextPosition objects using PDFTextStripper as shown in the example: https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java

As shown, the TextPosition has been retrieved from fields like text.getXDirAdj(), text.getWidthDirAdj(), text.getYDirAdj(), text.getHeightDir() .

From this example I tried to keep everything else the same except setting the cropBox of the target page.

https://github.com/apache/pdfbox/blob/2.0.3/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java

OLD CROPBOX: [0.0,0.0,595.276,841.89] -> NEW CROPBOX [50.0,42.0,592.0,642.0].

So how can I use the getYDirAdj and getXDirAdj in setting the cropbox correctly ?

The original pdf file I'm processing can be downloaded from here: http://downloadcenter.samsung.com/content/UM/201504/20150407095631744/ENG-US_NMATSCJ-1.103-0330.pdf


Answer:

Cropping the page

In a comment the OP reduced his problem to

Ok. Given a java PDRectangle rect = new PDRectangle(40f, 680f, 510f, 100f) obtained from TextLocation how would a java code snippet, that sets the cropBox of a single page look like ? Or how would you do it? TextLocation based rect --> some transformation --> setCropBox(theRightBox).

To set the crop box of the page twelve of the given document to the given PDRectangle you can use code like this:

PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));
pdDocument.save(new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.pdf"));

(SetCropBox.java test method testSetCropBoxENG_US_NMATSCJ_1_103_0330)

Adobe Reader now shows merely this part of page twelve:

Beware, though, the page in question does not only specify a media box (mandatory) and a crop box, it also defines a bleed box and an art box. Thus, application which consider those boxes more interesting than the crop box, might display the page differently. In particular the art box (being defined as "the extent of the page’s meaningful content") might by some applications be considered important.

Rendering the cropped page

In a comment to this answer the OP remarked

This is good and works. It correctly saves the page in the PDF file. I've tried to do the same in JPG and failed.

I reduced the OP's code to the essentials

PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));

PDFRenderer renderer = new PDFRenderer(pdDocument);
BufferedImage img = renderer.renderImage(12 - 1, 4f);
ImageIOUtil.writeImage(img, new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.jpg").getAbsolutePath(), 300);
pdDocument.close();

(SetCropBox.java test method testSetCropBoxImgENG_US_NMATSCJ_1_103_0330)

The result:

Thus, I cannot reproduce an issue here.


Possible details to check for:

  • ImageIOUtil is not part of the main PDFBox artifact, instead it is located in pdfbox-tools; does the version of that artifact match the version of the core pdfbox artifact?
  • I run the code in an Oracle Java 8 environment; other Java environments might give rise to different results.
  • There are minor differences in our implementations. E.g. I load the PDF via an InputStream, you directly from file system, I have hardcoded the page number, you have it in some variable, ... None of these differences should cause your problem, but who knows...

Question:

I want to verify PDF Document using TestNG and PDFBox.

I would ask, is PDF able to check contains text like this:

PDFParser parser =  new PDFParser(stream);
parser.getDocument().conntains("ABC")

Answer:

Try below code:-

  public void ReadPDF() throws Exception {
    URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf");

    BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());
    PDFParser TestPDF = new PDFParser(TestFile);
    TestPDF.parse();
    String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument());

    Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this"));

    }

Download libraries :- https://pdfbox.apache.org/index.html

Question:

I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.

I'm using PDFBox 1.8.8, with Java 7.

PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));

It just prints

File: /foo/bar/mypdf.pdf readable: true size: 1267743

Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.

I get no exception, no nothing. Any ideas?

EDIT: Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5

Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer

EDIT2:

Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:

Encrypted:      yes (print:yes copy:no change:no addNotes:no algorithm:AES)

Anyway to circumvent that? After all, pdftotext could get the text out.


Answer:

The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf

Question:

I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font.

However, the text extraction now causes another type of problem. In my case, when the charater sequences "fi" or "fl" occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

My question: is there anything what I can do to avoid this problem?

thanks in advance ...

EDIT:

@mkl Here is a simple code-snippet (for testing-purposes only):

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Vector;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

public class PDFTextStripperOpenFontTest extends PDFTextStripper {

public PDFTextStripperOpenFontTest(String encoding) throws IOException {
    super(encoding);
}

public static void main(String[] args) throws IOException {
    String fileName = "filename";
    PDFTextStripperOpenFontTest stripper = new PDFTextStripperOpenFontTest("UTF-8");
    stripper.setSortByPosition(false);
    PDDocument doc = PDDocument.load(new File(fileName));
    stripper.setStartPage(1);
    stripper.setEndPage(1);
    //extract text by getText(doc)
    String text1 = stripper.getText(doc);
    text1 = text1.replaceAll("\\r|\\n", "");
    System.out.println(text1);
    //extract text by
    StringBuilder sb = new StringBuilder();
    Vector<List<TextPosition>> list = stripper.getCharactersByArticle();
    for (List<TextPosition> list2 : list) 
        for (TextPosition textPosition : list2) 
            sb.append(textPosition.getCharacter());
    System.out.println(sb.toString());  
    doc.close();
}
}

And here is the output for the test file (the differences begin in the second half of the page):

des Verhältnisses von Individuum und Gesellschaft habe (vgl. Holzkamp-Osterkamp 1976, Kap. 5.2). Er schlug allerdings vor, Sigmund Freuds persön-lichkeitstheoretische Konzeptionen »der Abwehr-vorgänge, der Angst und des Unbewussten« mit historisch bestimmten gesellschaftlichen Verhältnis-sen zu vermitteln (255f) und sie subjektwissenschaft-lich zu reinterpretieren. »Freuds Prämissen von der  genuinen Unvereinbarkeit subjektiver Lebensan-sprüche mit gesellschaftlichen Anforderungen« sind für Holzkamp mehr als eine »falsche Universali-sierung bürgerlich-kapitalistischer Verhältnisse«, sie brächten auch »bestimmte Aspekte der subjektiven Situation der Menschen unter diesen Verhältnissen […] differenziert und schonungslos« auf den Begriff (1984, 33). Die Auffassung, dass die ich-psycholo-gische »›Soziologisierung‹« die »Schärfe und Uner-bittlichkeit« der Psychoanalyse »verkleistert« (ebd.), teilt er mit Adorno, der darin deren »Kastrierung« behauptet (1952, 25).3. Die Entwicklung einer eigenständigen KP begann mit dem Problem, allein auf der Basis der marx-schen KrpÖ verschiedene Ansätze und Befunde der »bürgerlichen« Psychologie nicht »differenziell«, d.h. nach ihrem jeweiligen Verhältnis von Erkennt-nismöglichkeiten und -grenzen beurteilen und somit nicht kritisch-psychologisch aufheben zu können (Maiers 1979). »Positive Ergebnisse über die empi-rische Subjektivität des Menschen in der bürgerlichen Gesellschaft« (Holzkamp 1978, 249) versprachen zwei Wege historisch-rekonstruktiver Forschung: wissenschaftsbezogen zu untersuchen, wie es dazu kam, dass im 19. Jh. »empirische Subjektivität« auf eine Weise problematisch wurde, die zur Entstehung der Einzelwissenschaft »Psychologie« führte (1973, 45); und gegenstandsbezogen die Grundlagen der menschlichen Subjektivität von der Entstehung des Psychischen über das Tier-Mensch-Übergangsfeld bis zur Existenz in kapitalistischen Verhältnissen zu analysieren (46f). Während der erste Weg im Umkreis der KP v.a. von Siegfried Jaeger und Irmingard Staeuble (1978) verfolgt wurde, konzentrierte sich der Kreis um Holzkamp zunächst auf den zweiten. Holzkamp resümiert später, dass »die Unterscheidung zwischen wissenschafts- und gegenstandsbezogener Kategorialanalyse nur ›aspekthafter‹ Natur ist« (1983, 37). Problematisch bleibt, inwieweit die Begriffe, von denen auch die Rekonstruktion des Psychischen ihren Ausgang nehmen muss, diese formieren (Fries 2011). Insofern müssen bei einer »vollständigen his-torischen Analyse« die betreffenden psychologischen Konzepte »›im Schnittpunkt‹« beider »Entwicklungs-züge« begriffen werden (Holzkamp 1973, 47).4. Der Anspruch der historischen Rekonstruktion des Psychischen ist es, die auf einen vorparadigma-tischen (Métraux 1981; Graumann 1994) Zustand der Psychologie verweisende »Beliebigkeit« (Holz-kamp 1977b) der Begriffsbildung zu überwinden und die Psychologie kategorial neu zu begründen. Termi-nologisch ist es schwierig, dass in der KP die Bezeich-nung wissenschaftlich ausgewiesener Grundbegriffe als »Kategorien« (programmatisch Holzkamp 1983, 19 u. 27) im Gegensatz zur Verwendungsweise bei Marx steht, der damit gerade Alltagsvorstellun-gen meint (vgl. W.F.Haug 2008), die wiederum bei Holzkamp als »Vorbegriffe« (1983, 48ff u. 515ff) bezeichnet werden.4.1 In einer Spezifi zierung des logisch-historischen Herangehens ging es der KP darum, das Verhältnis von Natur-, Gesellschafts- und Individualgeschichte begriffl ich so aufzuklären, dass dem entwicklungs-geschichtlich Früheren das begriffl ich Allgemeinere und dem entwicklungsgeschichtlich Späteren das begriffl ich Spezifi schere entsprechen soll. Dies als eine Grundlage dafür, sowohl Anthropomorphisie-rungen tierischen Verhaltens als auch Biologisierun-gen gesellschaftlicher Verhältnisse bzw. menschlichen Handelns und Erlebens und damit Universalisierun-gen historisch spezifi scher Ausdrucksformen des Psychischen zu vermeiden. Die KP schloss an Arbei-ten der »Kulturhistorischen Schule« an, bes. an  Alexej N. Leontjews 1959 veröffentlichtes »historisches Herangehen an die Untersuchungen der mensch-lichen Psyche« (1973, 262ff; vgl. Holzkamp 1983, 47). Gegen dessen Durchführung wandten Holz-kamp und Volker Schurig (1973, XLVI) jedoch ein, dass der Gedanke der Historizität des Psychischen in Leontjews »Forschungsarbeit nicht überall mit gleicher Entschiedenheit Berücksichtigung« gefun-den habe, die Gesellschaftlichkeit des Menschen zwar allgemein bestimmt, nicht aber formationsspe-zifi sch auf gesellschaftliche Widersprüche in der SU konkretisiert worden sei. – In der KP bestanden mit Blick auf die DDR, der Holzkamp eine »systembe-dingte Konvergenz zwischen allgemeinen und indi-viduellen Interessen« (1983, 382) attestierte, ähnliche Neigungen zur Widerspruchseliminierung, die von der Hoffnung auf dortige Publikationszustimmung genährt waren (vgl. Markard 2009a, 199f).4.2 Soweit sich die Kategorialanalyse auf biologi-sche Evolutionsprozesse bezieht, wird sie als »funk-tional-historisch« charakterisiert (Maiers 1999). Sie zielt auf die Rekonstruktion von Widersprüchen in Organismus-Umwelt-Konstellationen, aus denen Entwicklungen und neue Qualitäten in ihrer biolo-gischen Funktionalität begreifbar werden – bezogen auf jene Evolutionsreihe, die zum Menschen hin-führt, unter dem Gesichtspunkt der Entstehung und Kritische Psychologie 171 172
des Verhältnisses von Individuum und Gesellschaft habe (vgl. Holzkamp-Osterkamp 1976, Kap. 5.2). Er schlug allerdings vor, Sigmund Freuds persön-lichkeitstheoretische Konzeptionen »der Abwehr-vorgänge, der Angst und des Unbewussten« mit historisch bestimmten gesellschaftlichen Verhältnis-sen zu vermitteln (255f) und sie subjektwissenschaft-lich zu reinterpretieren. »Freuds Prämissen von der  genuinen Unvereinbarkeit subjektiver Lebensan-sprüche mit gesellschaftlichen Anforderungen« sind für Holzkamp mehr als eine »falsche Universali-sierung bürgerlich-kapitalistischer Verhältnisse«, sie brächten auch »bestimmte Aspekte der subjektiven Situation der Menschen unter diesen Verhältnissen […] differenziert und schonungslos« auf den Begriff (1984, 33). Die Auffassung, dass die ich-psycholo-gische »›Soziologisierung‹« die »Schärfe und Uner-bittlichkeit« der Psychoanalyse »verkleistert« (ebd.), teilt er mit Adorno, der darin deren »Kastrierung« behauptet (1952, 25).3. Die Entwicklung einer eigenständigen KP begann mit dem Problem, allein auf der Basis der marx-schen KrpÖ verschiedene Ansätze und Befunde der »bürgerlichen« Psychologie nicht »differenziell«, d.h. nach ihrem jeweiligen Verhältnis von Erkennt-nismöglichkeiten und -grenzen beurteilen und somit nicht kritisch-psychologisch aufheben zu können (Maiers 1979). »Positive Ergebnisse über die empi-rische Subjektivität des Menschen in der bürgerlichen Gesellschaft« (Holzkamp 1978, 249) versprachen zwei Wege historisch-rekonstruktiver Forschung: wissenschaftsbezogen zu untersuchen, wie es dazu kam, dass im 19. Jh. »empirische Subjektivität« auf eine Weise problematisch wurde, die zur Entstehung der Einzelwissenschaft »Psychologie« führte (1973, 45); und gegenstandsbezogen die Grundlagen der menschlichen Subjektivität von der Entstehung des Psychischen über das Tier-Mensch-Übergangsfeld bis zur Existenz in kapitalistischen Verhältnissen zu analysieren (46f). Während der erste Weg im Umkreis der KP v.a. von Siegfried Jaeger und Irmingard Staeuble (1978) verfolgt wurde, konzentrierte sich der Kreis um Holzkamp zunächst auf den zweiten. Holzkamp resümiert später, dass »die Unterscheidung zwischen wissenschafts- und gegenstandsbezogener Kategorialanalyse nur ›aspekthafter‹ Natur ist« (1983, 37). Problematisch bleibt, inwieweit die Begriffe, von denen auch die Rekonstruktion des Psychischen ihren Ausgang nehmen muss, diese formieren (Fries 2011). Insofern müssen bei einer »vollständigen his-torischen Analyse« die betreffenden psychologischen Konzepte »›im Schnittpunkt‹« beider »Entwicklungs-züge« begriffen werden (Holzkamp 1973, 47).4. Der Anspruch der historischen Rekonstruktion des Psychischen ist es, die auf einen vorparadigma-tischen (Métraux 1981; Graumann 1994) Zustand der Psychologie verweisende »Beliebigkeit« (Holz-kamp 1977b) der Begriffsbildung zu überwinden und die Psychologie kategorial neu zu begründen. Termi-nologisch ist es schwierig, dass in der KP die Bezeich-nung wissenschaftlich ausgewiesener Grundbegriffe als »Kategorien« (programmatisch Holzkamp 1983, 19 u. 27) im Gegensatz zur Verwendungsweise bei Marx steht, der damit gerade Alltagsvorstellun-gen meint (vgl. W.F.Haug 2008), die wiederum bei Holzkamp als »Vorbegriffe« (1983, 48ff u. 515ff) bezeichnet werden.4.1 In einer Spezifi zierung des logisch-historischen Herangehens ging es der KP darum, das Verhältnis von Natur-, Gesellschafts- und Individualgeschichte begriffl ich so aufzuklären, dass dem entwicklungs-geschichtlich Früheren das begriffl ich Allgemeinere und dem entwicklungsgeschichtlich Späteren das begriffl ich Spezifi schere entsprechen soll. Dies als eine Grundlage dafür, sowohl Anthropomorphisie-rungen tierischen Verhaltens als auch Biologisierun-gen gesellschaftlicher Verhältnisse bzw. menschlichen Handelns und Erlebens und damit Universalisierun-gen historisch spezifi scher Ausdrucksformen des Psychischen zu vermeiden. Die KP schloss an Arbei-ten der »Kulturhistorischen Schule« an, bes. an  Alexej N. Leontjews 1959 veröffentlichtes »historisches Herangehen an die Untersuchungen der mensch-lichen Psyche« (1973, 262ff; vgl. Holzkamp 1983, 47). Gegen dessen Durchführung wandten Holz-kamp und Volker Schurig (1973, XLVI) jedoch ein, dass der Gedanke der Historizität des Psychischen in Leontjews »Forschungsarbeit nicht überall mit gleicher Entschiedenheit Berücksichtigung« gefun-den habe, die Gesellschaftlichkeit des Menschen zwar allgemein bestimmt, nicht aber formationsspe-zifi sch auf gesellschaftliche Widersprüche in der SU konkretisiert worden sei. – In der KP bestanden mit Blick auf die DDR, der Holzkamp eine »systembe-dingte Konvergenz zwischen allgemeinen und indi-viduellen Interessen« (1983, 382) attestierte, ähnliche Neigungen zur Widerspruchseliminierung, die von der Hoffnung auf dortige Publikationszustimmung genährt waren (vgl. Markard 2009a, 199f).4.2 Soweit sich die Kategorialanalyse auf biologi-sche Evolutionsprozesse bezieht, wird sie als »funk-tional-historisch« charakterisiert (Maiers 1999). Sie zielt auf die Rekonstruktion von Widersprüchen in Organismus-Umwelt-Konstellationen, aus denen Entwicklungen und neue Qualitäten in ihrer biolo-gischen Funktionalität begreifbar werden – bezogen auf jene Evolutionsreihe, die zum Menschen hin-führt, unter dem Gesichtspunkt der Entstehung und Kritische Psychologie 171 172

you can find the pdf-file used for this test under this link: https://issues.apache.org/jira/browse/PDFBOX-2548.

I added a second test page, from a former volume of the same wordbook. For this volume, a Type1 font has been used. I chose a page where the two words "begrifflich" and "spezifisch" occur (they cause problems as you can see in the first test). As you can see/test, the described error doesn't occur when extracting the text of this second page! This strenghens my assumption that the OpenType format is the reason for the occuring error.

Here is the second output:

ehrenamtliche Arbeit, Eigenarbeit, Elend, Exklusion,Gemeinwesen, Genossenschaft, Kommunalpolitik, Kom-munitarismus, Nachbarschaftsbewegung, neue sozialeBewegungen, Owenismus, Selbstverwaltung, Sozialarbeit,Sozialfürsorge, Sozialpolitik, Sozialstaat, Wohlfahrtsstaat,WohnungsfrageGemeinwirtschaftA: al-iqisäd at-ta’äwuni.– E: communal economy . –F: économie communautaire. –R: kooperativnoye khoziaysrvo. –S: economía comunitaria. – C: jiti jingjiDie Konzeption der ›G‹ ist nicht originär sozialistisch.Sie wurde erst nachträglich von der Arbeiterbewegungbegrifflich adaptiert und im Sinne eines nicht-kapita-listischen Wirtschaftssektors (v.a. gewerkschaftseigene,genossenschaftliche und öffentliche Unternehmen),der primär bedarfsorientiert und demokratisch kon-trolliert arbeiten sollte, in reformsozialistische Strate-gien integriert.In jeweils unterschiedlichen Formen und Strukturenhat sich seit Ende des 19. Jh. in den verschiedeneneuropäischen Ländern ein Unternehmenssektor mitnicht-kapitalistischen Eigentümern herausgebildet,in Gestalt staatlicher bzw. kommunaler und nicht-staatlicher Unternehmen. In Großbritannien entstandunter Einfluss des Munizipalsozialismus hauptsäch-lich ein kommunal-gemeinwirtschaftlicher Sektor;in Österreich versteht man unter G vorwiegend den(zentralisierten) Staatssektor. In Deutschland da-gegen hat die gewerkschaftlich organisierte Arbei-terbewegung selbst einen sog. freigemeinwirtschaft-lichen Sektor gefördert und zeitweilig aufgebaut,dessen Besonderheit eine Identifikation mit demenglischen ›social sector‹ oder dem französischen›secteur public‹ verbietet. So bezeichnet der Aus-druck im engeren Sinne eine spezifische, vornehmlichin Deutschland nach dem Ersten Weltkrieg geschaffeneRealität.1. Die Klassiker der deutschen G-Lehre (EugenDÜHRING, Albert SCHÄFFLE, Adolph WAGNER) gingendavon aus, dass in kapitalistischen Gesellschaften –um die Funktionsfähigkeit des marktwirtschaftlichenSystems zu gewährleisten – neben dem privatwirt-schaftlichen Sektor ein mehr oder weniger ausgedehn-ter gemeinwirtschaftlicher Sektor bestehen muss, dernicht profitorientiert ist, was Hans RITSCHL späterals »dualistische Wirtschaftsordnung« charakterisiert(1931). Die G umfasst für diese Tradition nicht nurstaatliche bzw. kommunale, sondern auch ›frei-gemeinwirtschaftliche‹ Unternehmen privater Trägerund Genossenschaften. Die klassische G-Lehre isteine harmonistische Konzeption (THIEMEYER 1980),die ein ›organisches‹ Nebeneinander von Privat- undGemeinwirtschaft anstrebt. Im System zunehmenderDominanz der Großindustrie und bei einer wach-senden und kämpferischen Arbeiterbewegung sollennicht-profitorientierte staatssozialistische Betriebefür sozialen Ausgleich sorgen und zur politischenEntspannung beitragen (SCHÄFFLE 1875).2. Bei MARX und ENGELS lassen sich nur wenige kon-zeptionelle Bezüge zur G-Lehre herstellen. Heftigpolemisiert Engels gegen DÜHRINGS auf die G-Kon-zeption zugeschnittene »Wirtschaftskommunen« (AD,MEW 20, 268ff). Bis zu dem Zeitpunkt, wo unterMarxisten reformistische Auffassungen an Bodengewinnen, besteht ein zentraler Gegensatz zwischenMarxismus und G-Lehre: den Klassikern des Marxis-mus kann in diesem Kontext der quasi umfassendsteG-Anspruch zugeschrieben werden (KÜHNE 1978),nämlich auf gesamtgesellschaftliche Verwirklichungeiner nicht-kapitalistischen G.Einzelne Formen der G werden von MARX undENGELS dem Inhalt nach angesprochen, etwa öffent-liche Unternehmen: so wird in den Forderungen derKommunistischen Partei in Deutschland 1848 dieVerstaatlichung des Transport-, Verkehrs- und Kom-munikationswesens verlangt, verbunden mit dempreispolitischen Konzept des ›Nulltarifs‹: die betref-fenden öffentlichen Dienstleistungen sollen »derunbemittelten Klasse zur unentgeltlichen Verfügunggestellt« werden (MEW 5, 4; vgl. 4, 373f).Die meisten der für die G-Thematik relevantenAusführungen beziehen sich auf genossenschaftlicharbeitende »Kooperativfabriken«. MARX widmet sichder Problematik von isolierten oder sektoral begrenz-ten Genossenschaftsunternehmen, die den privat-wirtschaftlichen ›Eigennutz‹ zunächst nur im Innen-verhältnis der Kooperativ-Mitglieder überwindenkönnen. Ein umfassendes und gesamtgesellschaftlichrationales System freier Kooperativarbeit hingegenlässt sich Marx zufolge nicht ohne gesellschaftlicheVeränderungen mit dem Übergang der Staatsmachtin die Hände der Produzenten selbst verwirklichen(Inauguraladresse, MEW 16, 11, 195f; Gotha, MEW19, 13-32). ENGELS spricht sich in späteren Jahrenmehrfach für genossenschaftliche Produktionsbe-triebe aus (z.B. MEW 36, 261, 426), freilich nicht imSinne einer harmonistischen G-Konzeption, bei der
ehrenamtliche Arbeit, Eigenarbeit, Elend, Exklusion,Gemeinwesen, Genossenschaft, Kommunalpolitik, Kom-munitarismus, Nachbarschaftsbewegung, neue sozialeBewegungen, Owenismus, Selbstverwaltung, Sozialarbeit,Sozialfürsorge, Sozialpolitik, Sozialstaat, Wohlfahrtsstaat,WohnungsfrageGemeinwirtschaftA: al-iqisäd at-ta’äwuni.– E: communal economy . –F: économie communautaire. –R: kooperativnoye khoziaysrvo. –S: economía comunitaria. – C: jiti jingjiDie Konzeption der ›G‹ ist nicht originär sozialistisch.Sie wurde erst nachträglich von der Arbeiterbewegungbegrifflich adaptiert und im Sinne eines nicht-kapita-listischen Wirtschaftssektors (v.a. gewerkschaftseigene,genossenschaftliche und öffentliche Unternehmen),der primär bedarfsorientiert und demokratisch kon-trolliert arbeiten sollte, in reformsozialistische Strate-gien integriert.In jeweils unterschiedlichen Formen und Strukturenhat sich seit Ende des 19. Jh. in den verschiedeneneuropäischen Ländern ein Unternehmenssektor mitnicht-kapitalistischen Eigentümern herausgebildet,in Gestalt staatlicher bzw. kommunaler und nicht-staatlicher Unternehmen. In Großbritannien entstandunter Einfluss des Munizipalsozialismus hauptsäch-lich ein kommunal-gemeinwirtschaftlicher Sektor;in Österreich versteht man unter G vorwiegend den(zentralisierten) Staatssektor. In Deutschland da-gegen hat die gewerkschaftlich organisierte Arbei-terbewegung selbst einen sog. freigemeinwirtschaft-lichen Sektor gefördert und zeitweilig aufgebaut,dessen Besonderheit eine Identifikation mit demenglischen ›social sector‹ oder dem französischen›secteur public‹ verbietet. So bezeichnet der Aus-druck im engeren Sinne eine spezifische, vornehmlichin Deutschland nach dem Ersten Weltkrieg geschaffeneRealität.1. Die Klassiker der deutschen G-Lehre (EugenDÜHRING, Albert SCHÄFFLE, Adolph WAGNER) gingendavon aus, dass in kapitalistischen Gesellschaften –um die Funktionsfähigkeit des marktwirtschaftlichenSystems zu gewährleisten – neben dem privatwirt-schaftlichen Sektor ein mehr oder weniger ausgedehn-ter gemeinwirtschaftlicher Sektor bestehen muss, dernicht profitorientiert ist, was Hans RITSCHL späterals »dualistische Wirtschaftsordnung« charakterisiert(1931). Die G umfasst für diese Tradition nicht nurstaatliche bzw. kommunale, sondern auch ›frei-gemeinwirtschaftliche‹ Unternehmen privater Trägerund Genossenschaften. Die klassische G-Lehre isteine harmonistische Konzeption (THIEMEYER 1980),die ein ›organisches‹ Nebeneinander von Privat- undGemeinwirtschaft anstrebt. Im System zunehmenderDominanz der Großindustrie und bei einer wach-senden und kämpferischen Arbeiterbewegung sollennicht-profitorientierte staatssozialistische Betriebefür sozialen Ausgleich sorgen und zur politischenEntspannung beitragen (SCHÄFFLE 1875).2. Bei MARX und ENGELS lassen sich nur wenige kon-zeptionelle Bezüge zur G-Lehre herstellen. Heftigpolemisiert Engels gegen DÜHRINGS auf die G-Kon-zeption zugeschnittene »Wirtschaftskommunen« (AD,MEW 20, 268ff). Bis zu dem Zeitpunkt, wo unterMarxisten reformistische Auffassungen an Bodengewinnen, besteht ein zentraler Gegensatz zwischenMarxismus und G-Lehre: den Klassikern des Marxis-mus kann in diesem Kontext der quasi umfassendsteG-Anspruch zugeschrieben werden (KÜHNE 1978),nämlich auf gesamtgesellschaftliche Verwirklichungeiner nicht-kapitalistischen G.Einzelne Formen der G werden von MARX undENGELS dem Inhalt nach angesprochen, etwa öffent-liche Unternehmen: so wird in den Forderungen derKommunistischen Partei in Deutschland 1848 dieVerstaatlichung des Transport-, Verkehrs- und Kom-munikationswesens verlangt, verbunden mit dempreispolitischen Konzept des ›Nulltarifs‹: die betref-fenden öffentlichen Dienstleistungen sollen »derunbemittelten Klasse zur unentgeltlichen Verfügunggestellt« werden (MEW 5, 4; vgl. 4, 373f).Die meisten der für die G-Thematik relevantenAusführungen beziehen sich auf genossenschaftlicharbeitende »Kooperativfabriken«. MARX widmet sichder Problematik von isolierten oder sektoral begrenz-ten Genossenschaftsunternehmen, die den privat-wirtschaftlichen ›Eigennutz‹ zunächst nur im Innen-verhältnis der Kooperativ-Mitglieder überwindenkönnen. Ein umfassendes und gesamtgesellschaftlichrationales System freier Kooperativarbeit hingegenlässt sich Marx zufolge nicht ohne gesellschaftlicheVeränderungen mit dem Übergang der Staatsmachtin die Hände der Produzenten selbst verwirklichen(Inauguraladresse, MEW 16, 11, 195f; Gotha, MEW19, 13-32). ENGELS spricht sich in späteren Jahrenmehrfach für genossenschaftliche Produktionsbe-triebe aus (z.B. MEW 36, 261, 426), freilich nicht imSinne einer harmonistischen G-Konzeption, bei der

... so far


Answer:

a bit late as the PDFBox issue PDFBOX-2548 opened in parallel already explained quite a bit, but here as a wrap-up:

To sum it up: The creation process of the first sample PDF used ligatures followed by actual space characters and insertion point movements. Whether this is due to programs in this process not properly supporting OpenType fonts (as the OP assumes) or some other weakness, cannot be decided based on the output PDFs alone.

To fix this issue, you should fix the document creation process. If that is not possible, you can try to enhance PDFBox to understand the weirdness created by that process but this is non-trivial and/or error-prone.

In detail:

when the charater sequences "fi" or "fl" occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'fi' and 'fl' and sets a space character on their right side.

A section of the first sample document test.pdf containing samples for both is this

with Spezifizierung and begrifflich.

The first thing to do in case of text extraction troubles usually is copying&pasting from Adobe Reader as that software is quite proficient in text extraction matters. We get:

In einer Spezifi zierung des
...
begriffl ich so aufzuklären

So Adobe Reader does extract these unwanted space characters, too! This usually means that there is some issue in the PDF, not in the text extractor.

Looking at the operations drawing the text we see:

[(4.1 In einer Spezifi)305.505( )-20.3063(zierung des logisch-historischen )]TJ 

This means that after drawing "Spezifi" the text insertion point is moved back by 0.3 text space units, then a space character is drawn, then the text insertion point is moved forward by 0.02 text space units, and then the rest of the line is drawn.

(begriffl)Tj
24.1273 0 Td
1.9264 Tw
[( )168.494(ich so aufzuklären, dass dem entwicklungs-)]TJ 

Similarly here after drawing "begriffl" the text insertion point is moved forward by 0.024 text space units, the the word spacing is changed to nearly 2 unscaled text space units (i.e. the width of a space character is increased by that amount), then a space character is drawn, then the text insertion point is moved backwards by 0.17 text space units and then the rest of the line is drawn.

Thus, in both cases there indeed is a single ligature glyph followed by a space character, just as PDFBox returns.

My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters fi / fl (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements).

No, it is not a weirdness of the PDFBox or Adobe Reader text extraction routines, it is a weirdness of the PDF creation process.

Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l

At some point PDFBox does expand the ligatures to their participant letters, as mentioned in PDFBOX-2548 this happens by design. Whether the choice of which methods return ligatures and which individual letters is a good one and a properly documented one, might be a different matter.

I added a second test page, from a former volume of the same wordbook. For this volume, a Type1 font has been used. I chose a page where the two words "begrifflich" and "spezifisch" occur (they cause problems as you can see in the first test). As you can see/test, the described error doesn't occur when extracting the text of this second page! This strenghens my assumption that the OpenType format is the reason for the occuring error.

Looking at the text drawing operations in test2.pdf

[(begrifflich adaptiert und im Sinne eines n)20.2892(i)20.3857(c)20.4141(h)20.3148(t)20.3801(-)20.4198(k)20.3148(a)20.4141(p)]TJ
....
(druck im engeren Sinne eine spezifische, vornehmlich)Tj

In this document there is no such weird ligature and space character use as in the first one. Thus, no reason for PDFBox (or Adobe Reader) to see such space characters.

My question: is there anything what I can do to avoid this problem?

Fix the document creation process.

If that is not possible, you might try and change your class derived from PDFTextStripper

  • either to ignore all TextPosition instances containing space characters (including the unwanted ones) and lateron deduce correct space characters by gaps or
  • check all recognized spaces for overlapping letters.

The former alternative is easier to implement but somewhat error prone, especially in densely typeset documents, the latter one more difficult to implement but less error prone.

Question:

I am currently setting the default appearance string to set the text color like this:

String defaultAppearance = "/Helv 12 Tf 0 0 1 rg";

field.setDefaultAppearance(defaultAppearance);

I cant seem to find anywhere if there are other options for the formatting string other than rg for RGB or g for black and white.

Is there a way to set the text color to an rgba color, set the text alpha, or documentation on the format of the default appearance string I could look at?


Answer:

You are looking for documentation on the format of the default appearance string. You can find that (surprise!) in the pdf specification ISO 32000:

DA string (Required; inheritable) The default appearance string containing a sequence of valid page-content graphics or text state operators that define such properties as the field’s text size and colour.

(ISO 32000-2, Table 228 — Additional entries common to all fields containing variable text)

And thereafter in more detail,

The default appearance string (DA) contains any graphics state or text state operators needed to establish the graphics state parameters, such as text size and colour, for displaying the field’s variable text. Only operators that are allowed within text objects shall occur in this string (see "Figure 9 — Graphics objects"). At a minimum, the string shall include a Tf (text font) operator along with its two operands, font and size. The specified font value shall match a resource name in the Font entry of the default resource dictionary (referenced from the DR entry of the interactive form dictionary; see "Table 224 — Entries in the interactive form dictionary"). A zero value for size means that the font shall be auto-sized: its size shall be computed as an implementation dependent function.

The default appearance string shall contain at most one Tm (text matrix) operator. If this operator is present, the interactive PDF processor shall replace the horizontal and vertical translation components with positioning values it determines to be appropriate, based on the field value, the quadding (Q) attribute, and any layout rules it employs. If the default appearance string contains no Tm operator, the viewer shall insert one in the appearance stream (with appropriate horizontal and vertical translation components) after the default appearance string and before the text-positioning and textshowing operators for the variable text.

(ISO 32000-2 section 12.7.4.3 — Variable text)

According to that figure 9 the allowed operation classes in a text object are are:

Among these the Text Showing and Marked Content operators are not graphics state or text state operators, thus the available operators are:

  • General graphics state w, J, j, M, d, ri, i, gs
  • Color CS, cs, SC, SCN, sc, scn, G, g, RG, rg, K, k
  • Text state Tc, Tw, Tz, TL, Tf, Tr, Ts
  • Text positioning Td, TD, Tm, T*

(ISO 32000-2, Table 50 — Operator categories)

Obviously I cannot copy the specification of all these operators here.

Of special interest, though, is the general graphics state operator gs which allows you to use an use an ExtGState resource to set transparency as already proposed by Tilman in a comment to your question.

One word of warning, though, many pdf processors will merely expect a font setting (Tf) and a simple color setting (rg / g / k) operation and ignore everything else.

Question:

PDFTextStripper has a functionality to extract text from the whole document, is there a way to extract text only after a certain value when the value is recognized, for example :

A B C D G   1 line

A B C D G   2 line

A B C D G   3 line

QUANTITY  4 line

I would like to start to extract text after it finds Quantity(String) If anyone dealt with PDFBox and have some suggestion, it would be much appreciated

Or is it possible to add to the list only when it hits a line after a value that text will contain?


Answer:

Easiest solution is to capture whole text and then create a Pattern that says -> "DESCRIPTION\\s*Reference\\s*QUANTITY(.*)" so basically i want to capture everything on a single page from the mentioned above

  1. create a function that would take String text as a parameter locate a single matcher.group(1), and return String or Optional<String>

  2. create a Pattern and tell that pattern with regex where would you like to start capturing from

Question:

I have more than 1000 pdf files in a folder , each one to be converted and saved in its corresponding text file . I'm a bit new to Java and i'm using PDFBox to make the conversion ; I successfully got the code for one single pdf , but I'm stuck on how to do the conversion for all the PDFS in a single Folder. Can someone help me to achieve that in Java? .

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;

public final class ExtractPdf
{


public static void main( String[] args ) throws IOException
{
    String fileName = "sample.pdf"; 
    PDDocument document = null;

    try (PrintWriter out = new PrintWriter("out.txt"))
    {
        document = PDDocument.load( new File(fileName));
        PDFTextStripper stripper = new PDFTextStripper();
        String pdfText = stripper.getText(document).toString();
        System.out.println( "Text in the area:" + pdfText);
        out.println(pdfText);

    }
    finally
    {
        if( document != null )
        {
            document.close();
        }
    }
 }
}

Thanks, Free


Answer:

Basically your question is how to go through a directory…

public static void main(String[] args) throws IOException
{
    File dir = new File("....");
    File[] files = dir.listFiles(new FilenameFilter()
    {
        // use anonymous inner class 
        @Override
        public boolean accept(File dir, String name)
        {
            return name.toLowerCase().endsWith(".pdf");
        }
    });
    // null check omitted!
    for (File file : files)
    {
        int len = file.getAbsolutePath().length();
        String txtFilename = file.getAbsolutePath().substring(0, len - 4) + ".txt";
        // check whether txt file exists omitted
        try (OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(txtFilename), Charsets.UTF_8);
             PDDocument document = PDDocument.load(file))
        {
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.writeText(document, out);
        }
    }
    // exception catch omitted. Add code here to avoid your whole job
    // dying if only one file is broken
}

Question:

I am trying to go from text to pdf but have only one of the pages rotated 90 degress. Main reason is that some of my text documents are a bit too large and need to be in landscape to look normal. I have tried a few things but it seems like everything rotates the text too. Is there an easy way to rotate the pdf to landscape but keep the text the same rotation?

        OutputStream outputStream = response.getOutputStream();
        PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
        Map<String, Documents> documents = getDocuments(user, documentReports);

        try (PDDocument documentToPrint = new PDDocument()){
            for(Document doc : documentReports){
                TextToPDF textToPDF = new TextToPDF();
                textToPDF.setFont(PDType1Font.COURIER);
                textToPDF.setFontSize(8);

                Document documentReport = documents.get(doc.getId());
                try(PDDocument pdDocument = textToPDF.createPDFFromText(new InputStreamReader(new ByteArrayInputStream(documentReport.getReportText().getBytes())))) {
                    pdfMergerUtility.appendDocument(documentToPrint, pdDocument);
                }
            }
            pdfMergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
            LocalDateTime localUtcTime = Java8TimeUtil.getCurrentUtcTime();
            documentToPrint.getDocumentInformation().setTitle(localUtcTime.toString());
            response.setHeader("Content-Disposition", "inline; filename=" + localUtcTime + ".pdf");
            response.setContentType("application/pdf");
            documentToPrint.save(outputStream);
        }

Answer:

So this might not work for everyone but I figured it out for my specific requirement. TextToPDF has a method called setLandscape before creating the pdf from text. textToPDF.setLandscape(true);

Question:

I want to add a state to an annotation like the image below

https://imgur.com/a/ZGeQo (Sorry i need at least 10 reputation to post images)

I try with this

PDAnnotationTextMarkup a= new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
a.getCOSObject().setString(COSName.STATE, "Completed");

But that does not work.


Answer:

According to the PDF specification ISO 32000-2:

12.5.6.3 Annotation states

Beginning with PDF 1.5, annotations may have an author-specific state associated with them. The state is not specified in the annotation itself but in a separate text annotation that refers to the original annotation by means of its IRT ("in reply to") entry (see "Table 176 — Additional entries specific to a link annotation"). States shall be grouped into a number of state models, as shown in "Table 174 — Annotation states".

State changes made by a user shall be indicated in a text annotation with the following entries:

  • The T entry (see "Table 172 — Additional entries in an annotation dictionary specific to markup annotations") shall specify the user.
  • The IRT entry (see "Table 176 — Additional entries specific to a link annotation") shall refer to the original annotation.
  • State and StateModel (see "Table 175 — Additional entries specific to a text annotation") shall update the state of the original annotation for the specified user.

Additional state changes shall be made by adding text annotations in reply to the previous reply for a given user.

Table 174 — Annotation states — contains e.g. an entry in the state model "Review" for the state "Completed".

Thus, you have to set the state by adding a new text annotation in reply to the previous reply to the annotation (or, missing that, in reply to the annotation itself) with the state information.

Question:

I am trying to convert a unicode Text File to PDF using PDF box.

Task: My method takes a unicode encoded TextFile as input and output a PDF file.

Problem: The PDFs that are created have zero bytes. It is not writing anything.

I am using Apache PDFBox 2.0.6

This is my code:

public class TexttoPDF {

    public File texttoPDF(File textFile) throws Exception {

        PDDocument document = new PDDocument();
        PDPage blankPage = new PDPage();
        PDFont font = PDType1Font.TIMES_ROMAN;
        PDPageContentStream contentStream = new PDPageContentStream(document, blankPage);




        BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(textFile), "UTF8"));


        String str;
        contentStream.beginText();
        contentStream.setFont( font, 12 );
        contentStream.moveTextPositionByAmount( 100, 700 );


        while ((str = in.readLine()) != null) {
            contentStream.drawString(str);


        }


        contentStream.endText();

        document.save( pdffile.getName());
        contentStream.close();
        document.close();
        in.close();



    return pdffile;

    }
}

How this can be fixed ?


Answer:

Close your content stream before saving, not after saving. So change

    document.save( pdffile.getName());
    contentStream.close();

to

    contentStream.close();
    document.save( pdffile.getName());

(This is described in the FAQ)

Also add the page to your document after calling new PDPage():

document.addPage(blankPage);

Question:

I'm using PDFBox 2.1.0-SNAPSHOT in a Java 8 Spring Boot app to add a transparent text watermark on each page of an otherwise image-only PDF document. Visually, it works fine, as I'm able to see the image through the watermark on any reader and all browser-inline PDF Viewers.

However, when I print these documents from IE, an opaque white background covers up the image behind the text. The watermark text is still transparent, but the bounding box of the text is white. Again, printing from all other browsers works fine. (Gotta love IE.)

Here's the code I'm using to add the watermark to each page:

public void watermark(File pdfFile, OutputStream output) throws IOException {
    try (final InputStream sourceStream = new FileInputStream(pdfFile);
         final PDDocument document = PDDocument.load(sourceStream)) {

        for (int pageNumber = 0; pageNumber < document.getNumberOfPages(); pageNumber++) {
            PDPage currPage  = document.getPage(pageNumber);
            writeWatermarkOnPage(document, currPage);
        }
        document.save(output);
    }
}

private void writeWatermarkOnPage(PDDocument document, PDPage page) throws IOException {
    try (PDPageContentStream contentStream = new PDPageContentStream(
            document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {

        PDRectangle rect = page.getBBox();
        // Set the opacity
        PDExtendedGraphicsState extendedGraphicsState = new PDExtendedGraphicsState();
        extendedGraphicsState.setNonStrokingAlphaConstant(0.3f);
        contentStream.setGraphicsStateParameters(extendedGraphicsState);

        // Add the text
        contentStream.beginText();
        contentStream.setFont(PDType1Font.HELVETICA_BOLD, 75);
        contentStream.setNonStrokingColor(Color.GRAY);
        AffineTransform at = new AffineTransform(1, 0, 0, 1,
                                                 rect.getUpperRightX() / 4,
                                                 rect.getUpperRightY() / 4);
        Matrix matrix = new Matrix(at);
        matrix.rotate(Math.toRadians(45));
        contentStream.setTextMatrix(matrix);
        contentStream.showText("WATERMARK-TEXT");
        contentStream.endText();
    }
}

I tried using the Overlay class, but that had the same result. I tried removing the rotation and transform, but that didn't help. Only if I remove the nonStrokingAlphaConstant setting will the opaque white background go away when printed from IE's inline PDF renderer, but then the text isn't transparent anymore.

Is there something else I need to do to tell every PDF reader in every context that the background of the text should be completely transparent?

UPDATE

Here is an example PDF Document that shows this behavior. On Windows, I just drag & drop this into IE, print it, and the white background of the watermark text covers up the underlying image.

Here is another example PDF created and watermarked with the same code that actually prints just fine from IE. The watermark is transparent with no white background.

I believe the difference is that the broken documents are legal sized images, while the working document is letter size. Perhaps something related to scaling is causing the issue?


Answer:

I was able to meet my goal by using a transparent PNG as the watermark instead of adding it as text. The new watermarked files now print correctly from all browsers, including IE. Here is the code I used to add the watermark to every page of the PDF:

private static final String WATERMARK_RESOURCE_PATH = "/watermark/hcro_copy.png";

public void watermark(File pdfFile, OutputStream output) throws IOException {
    try (final InputStream sourceStream = new FileInputStream(pdfFile);
         final PDDocument document = PDDocument.load(sourceStream)
    ) {
        for (int pageNumber = 0; pageNumber < document.getNumberOfPages(); pageNumber++) {
            PDPage currPage  = document.getPage(pageNumber);
            writeWatermarkWithTransparentImageOnPage(document, currPage);
        }
        document.save(output);
    }
}

private void writeWatermarkWithTransparentImageOnPage(PDDocument document, PDPage page)
        throws IOException {
    try (PDPageContentStream contentStream = new PDPageContentStream(
            document, page, PDPageContentStream.AppendMode.APPEND, true, true);
         InputStream watermarkFileStream = getWatermarkFileStream()
    ) {
        // Load watermark image
        BufferedImage image = ImageIO.read(watermarkFileStream);
        PDImageXObject pdxImage = LosslessFactory.createFromImage(document, image);

        // Set the opacity
        PDExtendedGraphicsState extendedGraphicsState = new PDExtendedGraphicsState();
        extendedGraphicsState.setNonStrokingAlphaConstant(0.35f);
        contentStream.setGraphicsStateParameters(extendedGraphicsState);

        // Center watermark image on page
        PDRectangle rect = page.getBBox();
        int imageX = Math.floorDiv((Math.round(rect.getWidth()) - pdxImage.getWidth()), 2);
        int imageY = Math.floorDiv((Math.round(rect.getHeight()) - pdxImage.getHeight()), 2);

        contentStream.drawImage(pdxImage, imageX, imageY);
    }
}

private InputStream getWatermarkFileStream() {
    try {
        Resource resource = new ClassPathResource(WATERMARK_RESOURCE_PATH);
        return resource.getInputStream();
    }
    catch (IOException e) {
        throw new RuntimeException(e);
    }
}

I'd still be open to a text-only answer, but this works for me now.

Question:

I am currently using pdfbox 1.8 to analyze PDF documents. Below is a very stripped down example of what I am doing.

 import java.util.List;
 import java.io.IOException;
 import javax.swing.JFileChooser;
 import org.apache.pdfbox.pdmodel.PDDocument;
 import org.apache.pdfbox.pdmodel.PDPage;
 import org.apache.pdfbox.pdmodel.common.PDStream;

 public class Main 
 {
   private static PDDocument reader;

   public static void main(String[] args)
   {
       JFileChooser chooser = new JFileChooser();
       int result = chooser.showOpenDialog(null);
       if(result == JFileChooser.APPROVE_OPTION)
       {
           try
           {
               reader = PDDocument.load(chooser.getSelectedFile());
               for(int pagenum = 1; pagenum <= reader.getNumberOfPages(); pagenum++)
               {
                   System.out.println("===== Page:" + pagenum + " ======");
                   System.out.println(extract(pagenum));
               }

           }
           catch(Exception e) { e.printStackTrace(); }

       }
   }

   public static String extract(int pagenum) throws IOException
   {
       List allPages = reader.getDocumentCatalog().getAllPages();
       PDPage page = (PDPage) allPages.get(pagenum-1);
       PDStream contents = page.getContents();
       CustomPDFTextStripper stripper = new CustomPDFTextStripper();        
       if (contents != null) 
       {
           stripper.processStream(page, page.findResources(), page.getContents().getStream());
       }
       return stripper.getContents();
   }
 }

and

 import org.apache.pdfbox.util.PDFTextStripper;
 import java.io.IOException;
 import org.apache.pdfbox.util.TextPosition;

 public class CustomPDFTextStripper extends PDFTextStripper
 {
   private final StringBuilder builder;
   private float lastBase;
   public CustomPDFTextStripper() throws IOException
   {
       super.setSortByPosition(true);
       builder = new StringBuilder();
       lastBase = Float.MAX_VALUE;
   }

   public String getContents() { return builder.toString(); }

   @Override
   protected void processTextPosition(TextPosition textPos)
   {
       float ascent = textPos.getY();
       if(ascent > lastBase)
           builder.append("\n");
       lastBase = textPos.getY() + textPos.getHeight();
       builder.append(textPos.getCharacter());
       // I want to be able to do stuff here and
       // I need to read spaces and newline characters
   }
 }

I can't seem to find an equivalent solution in pdfbox 2.0 snapshot (I know it is unstable and has not been released yet). I tried to use something like:

 CustomPDFTextStripper stripper = new CustomPDFTextStripper();        
 StringWriter dummy = new StringWriter();
 stripper.setPageStart(""+(pagenum-1));
 stripper.setPageEnd(""+(pagenum-1));
 stripper.writeText(reader, dummy);

but it does not process spaces or give accurate textPos data in processTextPostion method.

Any ideas of how to get all of the TextPostion data same as 1.8 in 2.0?

========== EDIT 26JUN2015 8:00 PM CST ===========

Ok, I have had some time to look at it and found the problem. getWidthOfSpace() returns dramatically different result between 1.8 and 2.0.

In 1.8 it is around 2.49 - width of characters are around 5

In 2.0 it is around 27.5 - width of characters are around 5

Obviously 27.5 is wrong in 2.0

just run the following test and you will see

 @Override
 protected void processTextPosition(TextPosition textPos)
 {
    float spaceWidth = textPos.getWidthOfSpace();
    float width = textPos.getWidth();
    System.out.println(textPos.getCharacter() + " - Width of Space=" + spaceWidth + " - width=" + width);
    builder.append(textPos.getCharacter());
 }

(Of course getUnicode() for 2.0 instead of getCharacter())

===== EDIT 27JUN2015 8:00 PM CST ======

Here is link to PDF in used in test: Hello World


Answer:

There indeed is an error in the current calculation of the width of space. PDFTextStreamEngine.showGlyph(Matrix, PDFont, int, String, Vector) currently (it's a SNAPSHOT, the situation may differ this evening) calculates the width like this:

float horizontalScalingText = getGraphicsState().getTextState().getHorizontalScaling()/100f;
[...]
// the space width has to be transformed into display units
float spaceWidthDisplay = spaceWidthText * fontSizeText * horizontalScalingText *
        textRenderingMatrix.getScalingFactorX()  * ctm.getScalingFactorX();

(PDFTextStreamEngine.java in revision 1688116)

but the textRenderingMatrix has been calculated in PDFStreamEngine.showText(byte[]) using:

float horizontalScaling = textState.getHorizontalScaling() / 100f;
[...]
Matrix parameters = new Matrix(
        fontSize * horizontalScaling, 0, // 0
        0, fontSize,                     // 0
        0, textState.getRise());         // 1
[...]
Matrix textRenderingMatrix = parameters.multiply(textMatrix).multiply(ctm);

(PDFStreamEngine.java in revision 1688116)

Thus, both the font size and the horizontal scaling are multiplied twice into the space width. Furthermore the current transformation matrix is both fully multiplied into textRenderingMatrix and partially used as ctm.getScalingFactorX(); this can amount in most interesting combined results.

Most likely it should suffice to remove these values as explicit factors from the spaceWidthDisplay calculation in PDFTextStreamEngine.showGlyph(Matrix, PDFont, int, String, Vector)


In version 1.8.9 the text space width is calculated like this in PDFStreamEngine.processEncodedText(byte[]):

float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
                        * textMatrix.getXScale() * ctm.getXScale();

This can give rise to funny results, too, for interesting current transformation and text matrices but the factors of interest above were not multiplied twice into the result..

Question:

I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. I am using PDFBox library to achieve this.

However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. For example,

This string, is a code.

In this case: the index table would contain

This
string,
is
a
code.

The flaw here is for like string, , it comes with a comma where I think string would just be sufficient enough because nobody searches string, or code.

Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ?

Code:

File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();

for (File file : listOfFiles) {
   if (file.isFile()) {
      HashSet<String> uniqueWords = new HashSet<>();
      String path = "D:\\PDF1\\" + file.getName();
      try (PDDocument document = PDDocument.load(new File(path))) {    
          if (!document.isEncrypted()) {    
             PDFTextStripper tStripper = new PDFTextStripper();
             String pdfFileInText = tStripper.getText(document);
             String lines[] = pdfFileInText.split("\\r?\\n");
             for(String line : lines) {
                String[] words = line.split(" ");    
                for (String word : words) {
                    uniqueWords.add(word);   
                }

             }                            
          }
       } catch (IOException e) {
         System.err.println("Exception while trying to read pdf document - " + e);
       }
   }
}

Answer:

If you wanted to remove all punctuation you could do:

for(String word : words) {
    uniqueWords.add(word.replaceAll("[.,!?]", ""));
}

Which will replace all periods, commas, exclamation marks, and question marks.


If you also want to get rid of quotes you can do:

uniqueWords.add(word.replaceAll("[.,?!\"]", "")

Question:

File example is here

I think I need some help from mkl again( in attached file there are some hidden () characters which PDFTextStripper extracts. For now I don't see what makes it invisible (for example see column 6 values - all they are in brackets:

Publix Liquors 1,600 2.5 1/1/2014 12/31/2018 ($ 24,000.00) ($ 4,032.00) ($ 28,032.00) BayCare Health Systems 3,200 5 7/30/2004 7/31/2018 ($ 51,200.00) ($ 9,648.00) ($ 60,848.00) No rent change until Option Period 8/11/2018

..............

Could you please at least point why they are hidden in this case? thanks in advance!


Answer:

In this case you deal with actual transparency.

The hidden brackets are created by blocks of instructions like this

q
/Alpha3 gs
0 0 0 rg
BT
0 Tr
/Font0 14.299999 Tf
1.0 0 0 -1.0 537.66486 195.42578 Tm
0 0 Td
<037F>
Tj
ET
Q 

The extended graphics state Alpha3 in the resources is declared as

14 0 obj
<<
  /CA 0
  /ca 0
>>
endobj 

Thus, /Alpha3 gs sets both the stroke and the fill opacity to 0, i.e. anything drawn using fill or stroke is completely transparent.

In the PDFBox PDFTextStripper these values can be retrieved in processTextPosition from the current graphic state (getGraphicsState) as properties getAlphaConstant and getNonStrokeAlphaConstant

Question:

Command used to extract file is java -jar pdfbox-app-2.0.7.jar ExtractText -console DiffSzSpaceIssue.pdf. Output for the same is.

This%is%one%
This%is%two%
This%is%three%
This%is%four%

Checked the pdf with PDFDebugger. I see the following entry for the troubled "%"

Code Glyph Name Unicode Character Glyph
37   1          %                 None

Can you please how to properly extract text in such cases when there are unicodes but the glyphs are not present? I'm expecting the below output, as that "%" character is never rendered in the pdf.

This is one
This is two
This is three
This is four

Input pdf file is here.


Answer:

Apparently sometimes the Unicode mapping could be wrong in some pdfs and in such cases one needs to remove the Unicode mapping and re-try the extracting. This question clearly points out where the mapping is wrong. %->None (Unicode->Glyph) https://stackoverflow.com/a/45922162/6935152

Question:

Im trying to convert the pdf document to txt format and im trying to display it onto the text area, on pressing the OK button. The UI was created in netbeans 8.1.. There are no errors but im not getting the output.. im attaching the code below..

 private void okActionPerformed(java.awt.event.ActionEvent evt) {                                   

    try{ 
       String s = null;
       StringBuilder sb;
       File file = new File("D.pdf");
       PDDocument pdDoc = PDDocument.load(file);
       PDFTextStripper pdfStripper = new PDFTextStripper();
       String parsedText = pdfStripper.getText(pdDoc);
       textArea1.setText(parsedText);
      }catch (Exception e) {
        System.out.println(e)
     }
}    

this is the error that i get when i click the button run: java.lang.UnsupportedOperationException: Not supported yet.


Answer:

this is the error that i get when i click the button

Exception in thread "AWT-EventQueue-0"
java.lang.UnsupportedOperationException: Not supported yet.
    at textarea1.append(textarea1.java:22)
    at clickdb.okActionPerformed(clickdb.java:97)

This indicates that there is an issue in your textarea1 class, not in your PDFBox usage. You might want to inspect that class or post it for further analysis.

It is quite surprising, though, that the stack trace indicates that you call append, not setText as in the code in your question. If that stack trace was from a test run with a slightly different code, please update your information and include both the current code and a current stack trace.