Hot questions for Using PDFBox in itext

Question:

I have tagged a pdf using pdfbox.

How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC) and then I am adding that marked content to document root catalog structure.

What working: Almost everything is working fine like completely tagged pdf. It is passing the PAC3 accessibility checker also.

//Adding tags
tokens.add(++ind, type_check(t_ype, page));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
if (altText != null && !altText.isEmpty()) {
    currentMarkedContentDictionary.setString(COSName.ALT, altText);
}
mcid++;
tokens.add(++ind, currentMarkedContentDictionary);
tokens.add(++ind, Operator.getOperator("BDC"));

// Adding marked content to root structure
structureElement.appendKid(markedContent);

currentSection.appendKid(structureElement);             

What not working: After tagging one future Is missing from tag structure. There is an option called "Find Tag from Selection" . Is not working. It is going to last tag while I select some test and press " Find tag from selection" in root structure. Please find the pdf in below link.

https://drive.google.com/file/d/11Lhuj50Bb9kChvD0kL_GOHQn4RNKZ0hR/view?usp=sharing

Parent tree:

https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing

extra doc with tagging and parent tree: https://drive.google.com/file/d/1yzZSsjkb5_dGfq1Wu3VxsH73vr3alRmC/view?usp=sharing

Please help me to solve this problem.

New Problem: I observed that

while Jaws reading my tagged document, I am pressing controls like ctl+shift+5 in windows machine . It will show the options like drop down>"Read based on tagged structure" or >"Top left to bottom right" and below two radio buttons

Read curent page Read all pages image you can see. Shift+CTL+5 in adobe dc you can see image here

I selected "read based on tagging structure and Read current page" Now the jaws not reading the Tag structure. But if i use same doc for "Read entire document" it is reading perfect?

Link to doc:

https://drive.google.com/file/d/1CguMHa4DikFMP15VGERnPNWRq5vO3u6I/view?usp=sharing

Any help?


Answer:

A nesting issue

How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC)

You're doing this incorrectly. See for example the start of the page content stream in your document:

BT
0 i
/C0_0 18 Tf
41.91 740.175 Td
/H2 <</MCID  0  >> BDC
( \) F M M P  8 P S M E) Tj
ET
/TouchUp_TextEdit MP
BT
/C0_1 14 Tf
EMC 

Focusing on the beginning and end of text objects and marked content, we see that you have BT ... BDC ... ET ... BT ... EMC

According to the specification, though:

When the marked-content operators BMC, BDC, and EMC are combined with the text object operators BT and ET (see 9.4, "Text Objects"), each pair of matching operators (BMCEMC, BDCEMC, or BTET) shall be properly (separately) nested. Therefore, the sequences

BMC             BT
  BT              BMC
    …    and         …
  ET              EMC
EMC             ET

are valid, but

BMC             BT
  BT              BMC
    …    and         …
  EMC             ET
BT              EMC

are not valid.

(ISO 32000-1 section 14.6 "Marked Content")

This issue was fixed in the second shared PDF, res1.pdf.

Missing ParentTree and StructParents

The problem your question focuses on is

There is an option called "Find Tag from Selection" . Is not working.

Finding a tag from selection essentially means that you have the MCID of some content stream instruction and you search the structure element in the structure tree referencing that marked content ID.

How PDF processors are expected to do this, is described in section 14.7.4.4 "Finding Structure Elements from Content Items" of the PDF specification ISO 32000-1 (or section 14.7.5.4 in ISO 32000-2):

Because a stream cannot contain object references, there is no way for content items that are marked-content sequences to refer directly back to their parent structure elements (the ones to which they belong as content items). Instead, a different mechanism, the structural parent tree, shall be provided for this purpose. For consistency, content items that are entire PDF objects, such as XObjects, shall also use the parent tree to refer to their parent structure elements.

The parent tree is a number tree, accessed from the ParentTree entry in a document’s structure tree root. The tree shall contain an entry for each object that is a content item of at least one structure element and for each content stream containing at least one marked-content sequence that is a content item.

Your PDF does not have that ParentTree at all, and your page does not contain a StructParents entry to lookup in a parent tree. Thus, the prescribed way to get from marked content to the structure tree is impossible to go.

A ParentTree was added in the third shared PDF, new.pdf.

Incorrect ParentTree entries

While in new.pdf you have a ParentTree, its contents are clearly incorrect:

The ParentTree is a number tree, i.e. integers are mapped to something here, so there obviously must not be multiple entries for the same integer key.

Furthermore, looking inside one of those values:

one sees that you claim that the following StructElem is the value for all marked content IDs:

Inspecting this StructElem further, one sees that it represents the final paragraph on the final page.

Thus, your observation

Now instead of "selection not found " it is highlighting the last <P> tag in parent tree. Irrespective of what what we selected.

is what one can expect. If one expects any reasonable behavior at all, that is, with a ParentTree structure broken so badly.

Actually there was not only this new.pdf but also res.pdf and tagged without altext.pdf with ParentTrees, but all these ParentTrees were broken like the tree of new.pdf.

You might want to start inspecting the structures you create when analyzing an unwanted behavior.

Another issue with parent tree entries

The previously described issue in parent trees meanwhile has been resolved, different pages now have different struct parents and the parent tree arrays now reference the struct elements for distinct MCIDs.

For some documents a different error occurs now, though, e.g. "res29_08_19.pdf". Here the parent tree starts like this:

In particular the first entry in the array is for MCID 3, the second for MCID 4, ...

This is invalid, according to the specification

The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array.

(ISO 32000-1 section 14.7.4.4 "Finding Structure Elements from Content Items")

Thus, the first entry must be for MCID 0, the second for MCID 1, ...

You objected in a comment

No I used 0 and 1 Mcid's for Artifacts.

But as a corollary of the above: Do not give MCIDs to marked content sequences you don't have a structure element for! MCIDs are for going back and forth between the structure hierarchy and the content streams. If you mark a piece of content without having a structure element for it, don't give it a MCID.

Yet another issue with parent tree entries

You again report problems with your newest file mathpdf.pdf. And indeed, there are issues; Adobe Acrobat Preflight reports a 5 pages list of inconsistent parent tree mappings like this:

In contrast to the previous issues the cause does not become clear by looking at the parent tree alone, one also has to look at the structure hierarchy.

Doing so, though, one peculiarity immediately hits the eye: In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.

For example let's look at the MCID 0 on the first page. In the structure hierarchy you have:

In the parent tree you have:

You should have simply referenced object 238 (the structure hierarchy parent of MCID 0) directly from the parent tree array for page one instead of that in-between object 62 which claims to have that object 238 as parent and MCID 0 as kid.

The reported inconsistency may be due to the node referenced from the parent tree (in object 62) claims to be a P paragraph with a parent node (in object 238) which is a Span. That is not allowed, a paragraph may contain a span but it cannot be contained in one.

Question:

I want to add a word in a sentence of a PDF content.

For example:

This is a sample content.

I want to insert a word in that content like this output.

This is a nice sample content.

This is a sample code for itextPdf that I found in the internet. Assumed that the content already exists and we want to modify it by adding a text in a sentence.

try {
        //Create PdfReader instance.
        PdfReader pdfReader =
                new PdfReader(SRC);

        //Create PdfStamper instance.
        PdfStamper pdfStamper = new PdfStamper(pdfReader,
                new FileOutputStream(DEST));

        //Create BaseFont instance.
        BaseFont baseFont = BaseFont.createFont(
                BaseFont.TIMES_ROMAN,
                BaseFont.CP1252, BaseFont.NOT_EMBEDDED);

        //Get the number of pages in pdf.
        int pages = pdfReader.getNumberOfPages();
        System.out.println(pdfStamper.getOverContent(1));
        //Iterate the pdf through pages.
        for(int i=1; i<=pages; i++) {
            //Contain the pdf data.
            PdfContentByte pageContentByte =
                    pdfStamper.getOverContent(i);
            pageContentByte.setFlatness(89);

            pageContentByte.beginText();
            //Set text font and size.
            pageContentByte.setFontAndSize(baseFont, 14);

            pageContentByte.setTextMatrix(50, 720);

            //Write text
            pageContentByte.setWordSpacing(12);
            pageContentByte.showText("hello world");
            pageContentByte.endText();
        }

        //Close the pdfStamper.
        pdfStamper.close();

        System.out.println("PDF modified successfully.");
    } catch (Exception e) {
        e.printStackTrace();
    }

I tried itextPdf and PdfBox but neither of them would work.

I can get the objects in the pdf document using PDFStreamParser of pdfbox.

PDFOperator{Td}, COSArray{[COSString{Name }, COSFloat{163.994}, COSString{____________________________________________________}, COSFloat{-8.03223}, COSString{________________________________________________________}]}, PDFOperator{TJ}, COSInt{19}, PDFOperator{TL}, PDFOperator{T*}, COSArray{[COSString{T}, COSInt{36}, COSString{itle}, COSFloat{0.997925}, COSString{ }, COSFloat{-94.9982}, COSString{_____________________________________________________________________________________________________________}]}, PDFOperator{TJ}, PDFOperator{T*}, COSArray{[

How can I implement a code that inserts a text?


Answer:

Not.

Pdf is not a wysiwyg format. Internally, it's more like a file containing code. It has instructions for moving around a cursor, and drawing text and graphics at the tip of the cursor.

Then there's the fact that most instructions get packaged into "objects". All objects get placed in a dictionary that uses byte-offsets to reference them.

So inserting anything in a pdf-document will cause problems on 2 levels.

  1. You would mess up the byte-offset of everything in the document
  2. You would need to unscramble all the existing rendering operations to make sense of the document (to derive structure like lines of text, paragraph, etc) so that you can properly re-flow the content after you've inserted something.

Hence my short answer. You can't. And that immediately explains why none of the pdf toolkits you've tried can do it. It's simply an insanely hard task.

Question:


Answer:

The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.

Allow me to predict your response.

*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"

Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!

If you know something about PDF, I can predict your next reply.

"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"

Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.

"Then what are these page labels about?" you ask.

Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.

This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).

Question:

I´m looking for a Java PDF Merger solution where I can streaming the merged pdf while I getting (example from a REST API) the PDF pages parts from a REST api. A pseudo code should be something like this:

public void doGet(HttpServletRequest req, HttpServletResponse res) throws Exception {

    sOut = res.getOutputStream();

    MergeDocument merger = MergeDocument.merge(sOut);

    for (int i = 0; i < 1000; i++) {

        byte[] contentPDF = restClient.get("http://mywebsite.com/files/mypdf"+i+".pdf");
        merger.append(contentPDF);
        sOut.flush(); // sending merged PDF bytes now
    }

    sOut.close();
}

My point is to not wast heap memory with all PDFs in memory before start sending it to user. In other words, when I get a "contentBytes pdf" from rest I want to send it to the user as a streaming now.

Hope someone can help me :)


Answer:

Using itextpdf

package com.example.demo.controller;

import com.itextpdf.text.Document;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import javax.servlet.http.HttpServletResponse;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
@RequestMapping("/pdf")
public class PdfMerger {

  @GetMapping
  public void merge(HttpServletResponse response) {
    Document document = new Document(PageSize.LETTER);

    response.setContentType("application/pdf");
    response.setHeader("Content-disposition", "attachment; filename=\"merged.pdf\"");

    OutputStream outputStream = null;
    try {
      outputStream = response.getOutputStream();
      PdfCopy copy = new PdfCopy(document, outputStream);

      document.open();

      for (InputStream file : getPdfs()) {
        copy.addDocument(new PdfReader(file)); // writes directly to the output stream
      }

      outputStream.flush();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      if (document.isOpen()) {
        document.close();
      }
      try {
        if (outputStream != null) {
          outputStream.close();
        }
      } catch (IOException ioe) {
        ioe.printStackTrace();
      }
    }
  }

  private List<InputStream> getPdfs() {
    List<InputStream> list = new ArrayList<>();

    for (int i = 0; i < 10; i++){
      list.add(PdfMerger.class.getResourceAsStream("/pdf/1.pdf"));
      list.add(PdfMerger.class.getResourceAsStream("/pdf/2.pdf"));
    }

    return list;
  }

}

Question:

I want to create my custom dynamic XFA pdf form using iText or PDFBox.

For example, purchase order form given here, Populate dynamic XFA pdf form itext

I want to generate such PDF form using java library. How can I achieve this?


Answer:

No, the creation of XFA documents is not supported using iText or PDFBox. There are plans to start a project that would allow you to create XDP templates at iText Group, but there is no ETA as to when it will be ready.

Question:

We have been using the iText based PdfVeryDenseMergeTool we found in this SO question How To Remove Whitespace on Merge to merge multiple PDF files into a single PDF file. The tool merges PDFs without leaving any whitespace in between, and individual PDFs also get broken out across pages when possible.

We want to port PdfVeryDenseMergeTool to PDFBox. We found a PDFBox 2 based PdfDenseMergeTool that merges PDFs like this:

Individual PDFs:

Dense Merged PDF:

We are looking for something like this (this is already one in iText based PdfVeryDenseMergeTool but we want to do it using PDFBox 2) :

In our attempt to do the porting, we found that PdfVeryDenseMergeTool uses a PageVerticalAnalyzer that extends iText PDF Render Listener and does something every time a text, image, or arc is drawn in a PDF. And all the rendering info is then used to split an individual PDF across multiple pages. We tried looking for a similar PDF Render Listener in PDFBox 2 but found that the available PDFRenderer class only has image rendering methods. So we are not sure how to port PageVerticalAnalyzer to PDFBox.

If someone can suggest an approach to move forward, we'd greatly appreciate their help.

Thanks a lot!

EDIT 7 Feb 2020

At present, we are extending PDFGraphicsStreamEngine from PDFBox to make a custom rendering engine that tracks coordinates of images, text lines, and arcs when they are drawn. That custom engine will be the port of the PageVerticalAnalyzer. After that, we are hoping to be able to port PdfVeryDenseMergeTool to PDFBox.

EDIT 8 Feb 2020

Here is a very simple port of PageVerticalAnalyzer that handles images and text. I'm a PDFBox newbie, so my logic to handle images is probably wonky. Here's the basic approach:

Text: for every glyph printed, get the bottomY and make topY = bottomY + charHeight, mark those top/bottom points.

Image: for every call to drawImage(), it looks like there are two ways to figure out where it was drawn. First is using the coords from the last call to appendRectangle() and second is using the last calls to moveTo(), multiple lineTo(), and closePath(). I give the latter one priority. If I can't find any path (I found it in one PDF, in another, before drawImage(), I only found appendRectangle()), I use the former. If none of them exist, I have no clue what to do. Here's how I'm assuming PDFBox marks image coords using moveTo()/lineTo()/closePath():

Here is my current implementation:

import java.awt.geom.Point2D;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.contentstream.PDFGraphicsStreamEngine;
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;


public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine
{
    /**
     * This is a port of iText based PageVerticalAnalyzer found here
     * https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/merge/PageVerticalAnalyzer.java
     *
     * @param page PDF Page
     */
    protected PageVerticalAnalyzer(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        File file = new File("q2.pdf");

        try (PDDocument doc = PDDocument.load(file))
        {
            PDPage page = doc.getPage(0);
            PageVerticalAnalyzer engine = new PageVerticalAnalyzer(page);
            engine.run();

            System.out.println(engine.verticalFlips);
        }
    }

    /**
     * Runs the engine on the current page.
     *
     * @throws IOException If there is an IO error while drawing the page.
     */
    public void run() throws IOException
    {
        processPage(getPage());

        for (PDAnnotation annotation : getPage().getAnnotations())
        {
            showAnnotation(annotation);
        }
    }

    // All path related stuff

    @Override
    public void clip(int windingRule) throws IOException
    {
        System.out.println("clip");
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        System.out.printf("moveTo %.2f %.2f%n", x, y);
        lastPathBottomTop = new float[] {(Float) null, y};
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        System.out.printf("lineTo %.2f %.2f%n", x, y);
        lastLineTo = new float[] {x, y};
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        System.out.printf("curveTo %.2f %.2f, %.2f %.2f, %.2f %.2f%n", x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        // if you want to build paths, you'll need to keep track of this like PageDrawer does
        return new Point2D.Float(0, 0);
    }

    @Override
    public void closePath() throws IOException
    {
        System.out.println("closePath");
        lastPathBottomTop[0] = lastLineTo[1];
        lastLineTo = null;
    }

    @Override
    public void endPath() throws IOException
    {
        System.out.println("endPath");
    }

    @Override
    public void strokePath() throws IOException
    {
        System.out.println("strokePath");
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        System.out.println("fillPath");
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        System.out.println("fillAndStrokePath");
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException
    {
        System.out.println("shadingFill " + shadingName.toString());
    }

    // Rectangle related stuff

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        System.out.printf("appendRectangle %.2f %.2f, %.2f %.2f, %.2f %.2f, %.2f %.2f%n",
                p0.getX(), p0.getY(), p1.getX(), p1.getY(),
                p2.getX(), p2.getY(), p3.getX(), p3.getY());

        lastRectBottomTop = new float[] {(float) p0.getY(), (float) p3.getY()};
    }

    // Image drawing

    @Override
    public void drawImage(PDImage pdImage) throws IOException
    {
        System.out.println("drawImage");
        if (lastPathBottomTop != null) {
            addVerticalUseSection(lastPathBottomTop[0], lastPathBottomTop[1]);  
        } else if (lastRectBottomTop != null ){
            addVerticalUseSection(lastRectBottomTop[0], lastRectBottomTop[1]);
        } else {
            throw new Error("Drawing image without last reference!");
        }

        lastPathBottomTop = null;
        lastRectBottomTop = null;

    }

    // All text related stuff

    @Override
    public void showTextString(byte[] string) throws IOException
    {
        System.out.print("showTextString \"");
        super.showTextString(string);
        System.out.println("\"");
    }

    @Override
    public void showTextStrings(COSArray array) throws IOException
    {
        System.out.print("showTextStrings \"");
        super.showTextStrings(array);
        System.out.println("\"");
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode,
                             Vector displacement) throws IOException
    {
        // print the actual character that is being rendered 
        System.out.print(unicode);

        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);

        // rendering matrix seems to contain bounding box of dimensions the char
        // and an x/y point where bounding box starts
        //System.out.println(textRenderingMatrix.toString());

        // y of the bottom of the char 
        // not sure why the y value is in the 8th column
        // when I print the matrix, it shows up in the 6th column
        float yBottom = textRenderingMatrix.getValue(0, 7);

        // height of the char
        // using the value in the first column as the char height
        float yTop =  yBottom + textRenderingMatrix.getValue(0, 0);

        addVerticalUseSection(yBottom, yTop);
    }

    // Keeping track of bottom/top point pairs
    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++)
        {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++)
            {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
    private float[] lastRectBottomTop;
    private float[] lastPathBottomTop;
    private float[] lastLineTo;

}

I am looking for answers to the following questions:

  • How can I improve this implementation?
  • How to handle other things like curves that I have not handled?

Answer:

This answer suffers from the same issues as the original iText version does.

A port of the PageVerticalAnalyzer

One can port the PageVerticalAnalyzer as follows from iText to PDFBox:

public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine {
    protected PageVerticalAnalyzer(PDPage page) {
        super(page);
    }

    public List<Float> getVerticalFlips() {
        return verticalFlips;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            addVerticalUseSection(rect.getMinY(), rect.getMaxY());
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        Section section = null;
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                Point2D.Float point = ctm.transformPoint(x, y);
                if (section == null)
                    section = new Section(point.y);
                else
                    section.extendTo(point.y);
            }
        }
        addVerticalUseSection(section.from, section.to);
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        subPath = null;
        Section section = new Section(p0.getY());
        section.extendTo(p1.getY()).extendTo(p2.getY()).extendTo(p3.getY());
        currentPoint = p0;
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        subPath = new Section(y);
        path.add(subPath);
        currentPoint = new Point2D.Float(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        if (subPath == null) {
            subPath = new Section(y);
            path.add(subPath);
        } else
            subPath.extendTo(y);
        currentPoint = new Point2D.Float(x, y);
    }

    /**
     * Beware! This is incorrect! The control points may be outside
     * the vertically used range 
     */
    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        if (subPath == null) {
            subPath = new Section(y1);
            path.add(subPath);
        } else
            subPath.extendTo(y1);
        subPath.extendTo(y2).extendTo(y3);
        currentPoint = new Point2D.Float(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return currentPoint;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        path.clear();
        subPath = null;
    }

    @Override
    public void strokePath() throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
        // TODO Auto-generated method stub
    }

    Point2D currentPoint = null;

    List<Section> path = new ArrayList<Section>();
    Section subPath = null;

    static class Section {
        Section(double value) {
            this((float)value);
        }

        Section(float value) {
            from = value;
            to = value;
        }

        Section extendTo(double value) {
            return extendTo((float)value);
        }

        Section extendTo(float value) {
            if (value < from)
                from = value;
            else if (value > to)
                to = value;
            return this;
        }

        private float from;
        private float to;
    }

    void addVerticalUseSection(double from, double to) {
        addVerticalUseSection((float)from, (float)to);
    }

    void addVerticalUseSection(float from, float to) {
        if (to < from) {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++) {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++) {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
}

(PageVerticalAnalyzer.java)

The implementation actually is similar to that of the BoundingBoxFinder from this answer. Just like there I borrowed from the PDFBox example DrawPrintTextLocations to determine text outlines.

Furthermore, there is an issue in the curveTo processing corresponding to that of the original iText5 PageVerticalAnalyzer from this answer, the control points are treated as if they were on the actual curve but they actually usually are not and can be far outside the vertical use range of the curve. Instead of the path processing as implemented here one could use corresponding AWT classes but that may not be possible on Android etc.

And just like there this class ignores annotations, but the iText5 dense merger also ignored annotations. And this class also ignores the clip path...

A port of the PdfVeryDenseMergeTool
public class PdfVeryDenseMergeTool {
    public PdfVeryDenseMergeTool(PDRectangle size, float top, float bottom, float gap)
    {
        this.pageSize = size;
        this.topMargin = top;
        this.bottomMargin = bottom;
        this.gap = gap;
    }

    public void merge(OutputStream outputStream, Iterable<PDDocument> inputs) throws IOException
    {
        try
        {
            openDocument();
            for (PDDocument input: inputs)
            {
                merge(input);
            }
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.save(outputStream);
        }
        finally
        {
            closeDocument();
        }

    }

    void openDocument() throws IOException
    {
        document = new PDDocument();
        newPage();
    }

    void closeDocument() throws IOException
    {
        try
        {
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.close();
        }
        finally
        {
            this.document = null;
            this.yPosition = 0;
        }
    }

    void newPage() throws IOException
    {
        if (currentContents != null) {
            currentContents.close();
            currentContents = null;
        }
        currentPage = new PDPage(pageSize);
        document.addPage(currentPage);
        yPosition = pageSize.getUpperRightY() - topMargin;
        currentContents = new PDPageContentStream(document, currentPage);
    }

    void merge(PDDocument input) throws IOException
    {
        for (PDPage page : input.getPages())
        {
            merge(input, page);
        }
    }

    void merge(PDDocument sourceDoc, PDPage page) throws IOException
    {
        PDRectangle pageSizeToImport = page.getCropBox();

        PageVerticalAnalyzer analyzer = new PageVerticalAnalyzer(page);
        analyzer.processPage(page);
        List<Float> verticalFlips = analyzer.getVerticalFlips();
        if (verticalFlips.size() < 2)
            return;

        LayerUtility layerUtility = new LayerUtility(document);
        PDFormXObject form = layerUtility.importPageAsForm(sourceDoc, page);

        int startFlip = verticalFlips.size() - 1;
        boolean first = true;
        while (startFlip > 0)
        {
            if (!first)
                newPage();

            float freeSpace = yPosition - pageSize.getLowerLeftY() - bottomMargin;
            int endFlip = startFlip + 1;
            while ((endFlip > 1) && (verticalFlips.get(startFlip) - verticalFlips.get(endFlip - 2) < freeSpace))
                endFlip -=2;
            if (endFlip < startFlip)
            {
                float height = verticalFlips.get(startFlip) - verticalFlips.get(endFlip);

                currentContents.saveGraphicsState();
                currentContents.addRect(0, yPosition - height, pageSizeToImport.getWidth(), height);
                currentContents.clip();
                Matrix matrix = Matrix.getTranslateInstance(0, (float)(yPosition - (verticalFlips.get(startFlip) - pageSizeToImport.getLowerLeftY())));
                currentContents.transform(matrix);
                currentContents.drawForm(form);
                currentContents.restoreGraphicsState();

                yPosition -= height + gap;
                startFlip = endFlip - 1;
            }
            else if (!first) 
                throw new IllegalArgumentException(String.format("Page %s content sections too large.", page));
            first = false;
        }
    }

    PDDocument document = null;
    PDPage currentPage = null;
    PDPageContentStream currentContents = null;
    float yPosition = 0; 

    final PDRectangle pageSize;
    final float topMargin;
    final float bottomMargin;
    final float gap;
}

(PdfVeryDenseMergeTool.java)

This essentially is a simple port of the iText 5 PdfVeryDenseMergeTool, nothing special about it.

Usage of the PdfVeryDenseMergeTool

One simply creates a PdfVeryDenseMergeTool instance with format information and then starts the merge using PDDocument instances as sources:

PDDocument document1 = ...;
...
PDDocument documentN = ...;

PdfVeryDenseMergeTool tool = new PdfVeryDenseMergeTool(PDRectangle.A4, 30, 30, 10);
tool.merge(new FileOutputStream(RESULT_FILE), Arrays.asList(document1, ..., documentN));

(DenseMerging test testVeryDenseMerging)

Question:

When I extract an image using pdfbox I am getting incorrect dpi of the image for some PDFs. When I extract an image using Photoshop or Acrobat Reader Pro I can see that the dpi of the image is 200 using windows photo viewer, but when I extract the image using pdfbox the dpi is 72.

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document

When I check the logs I see an unusual entry: 2015-01-23-main--DEBUG-org.apache.pdfbox.util.TIFFUtil:

     <?xml version="1.0" encoding="UTF-8"?><javax_imageio_jpeg_image_1.0>
      <JPEGvariety>
    <app0JFIF majorVersion="1" minorVersion="2" resUnits="0" Xdensity="1" Ydensity="1" thumbWidth="0" thumbHeight="0"/>
  </JPEGvariety>
  <markerSequence>
    <dqt>
      <dqtable elementPrecision="0" qtableId="0"/>
      <dqtable elementPrecision="0" qtableId="1"/>
    </dqt>
    <dht>
      <dhtable class="0" htableId="0"/>
      <dhtable class="0" htableId="1"/>
      <dhtable class="1" htableId="0"/>
      <dhtable class="1" htableId="1"/>
    </dht>
    <sof process="0" samplePrecision="8" numLines="0" samplesPerLine="0" numFrameComponents="3">
      <componentSpec componentId="1" HsamplingFactor="2" VsamplingFactor="2" QtableSelector="0"/>
      <componentSpec componentId="2" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
      <componentSpec componentId="3" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
    </sof>
    <sos numScanComponents="3" startSpectralSelection="0" endSpectralSelection="63" approxHigh="0" approxLow="0">
      <scanComponentSpec componentSelector="1" dcHuffTable="0" acHuffTable="0"/>
      <scanComponentSpec componentSelector="2" dcHuffTable="1" acHuffTable="1"/>
      <scanComponentSpec componentSelector="3" dcHuffTable="1" acHuffTable="1"/>
    </sos>
  </markerSequence>
</javax_imageio_jpeg_image_1.0>

I tried to google but I can see to find out what pdfbox means by this log. What does this mean?

You can download a sample pdf with this problem from this link: http://myslams.com/test/1.pdf

I have even tried itext but it is extracting image with 96 dpi.

Am I doing something wrong? Or pdfbox and itext have this limitation?


Answer:

After some digging I found your 1.pdf. Thus,...

PDFBox

In comments to this recent answer @Tilman and you were discussing this older answer in which @Tilman pointed towards the PrintImageLocations PDFBox example. I ran it for your file and got:

Processing page: 0
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 888px
size = 613.44, 319.68
size = 8.52in, 4.44in
size = 216.408mm, 112.776mm

Processing page: 1
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 2
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 3
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 1464px
size = 613.44, 527.04
size = 8.52in, 7.3199997in
size = 216.408mm, 185.928mm

On all pages this amounts to 200 dpi both in x and y directions (1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200 dpi).

So PDFBox gives you the dpi values you are after.

(@Tilman: The current 2.0.0-SNAPSHOT version of that sample returns utter nonsense; you might want to fix this.)

iText

A simplified iText version of that PDFBox example would be this:

public void printImageLocations(InputStream stream) throws IOException
{
    PdfReader reader = new PdfReader(stream);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageRenderListener listener = new ImageRenderListener();

    for (int page = 1; page <= reader.getNumberOfPages(); page++)
    {
        System.out.printf("\nPage %s:\n", page);
        parser.processContent(page, listener);
    }
}

static class ImageRenderListener implements RenderListener
{
    public void beginTextBlock() { }
    public void renderText(TextRenderInfo renderInfo) { }
    public void endTextBlock() { }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        try
        {
            PdfDictionary imageDict = renderInfo.getImage().getDictionary();

            float widthPx = imageDict.getAsNumber(PdfName.WIDTH).floatValue(); 
            float heightPx = imageDict.getAsNumber(PdfName.HEIGHT).floatValue();
            float widthUu = renderInfo.getImageCTM().get(Matrix.I11);
            float heigthUu = renderInfo.getImageCTM().get(Matrix.I22);

            System.out.printf("Image %.0fpx*%.0fpx, %.0fuu*%.0fuu, %.2fin*%.2fin\n", widthPx, heightPx, widthUu, heigthUu, widthUu/72, heigthUu/72);
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

(Beware: I assumed unrotated and unskewed images.)

The results for your file:

Page 1:
Image 1704px*888px, 613uu*320uu, 8,52in*4,44in

Page 2:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 3:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 4:
Image 1704px*1464px, 613uu*527uu, 8,52in*7,32in

Thus, also 200dpi all along. So iText, too, gives you the dpi values you are after.

Your code

Obviously the code you referenced had no chance to report a dpi value sensible in the context of the PDF because it only extracts the images as found in the resources but ignores how the respective image resource is used on the page.

An image resource can be stretched, rotated, skewed, ... any way the author likes when he uses it in the page content.

BTW, a dpi value only makes sense if the author did not skew and rotated only by a multiple of 90°.

Question:

How can we extract text content from PDF file, we are using pdfbox to extract text from PDF file but we are getting header and footer is not required. I am using following java code.

PDFTextStripper stripper = null;
  try {
    stripper = new PDFTextStripper();
   } catch (Exception e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
   }
     stripper.setStartPage(pageCount);
     stripper.setEndPage(pageCount);
   try {
      String pageText = stripper.getText(document);
       System.out.println(pageText);  
    } catch (Exception e) {
     // TODO Auto-generated catch block
     e.printStackTrace();
 }

Answer:

You have tagged this as an itext/itextpdf question, yet you are using PdfBox. That's confusing.

You also claim that your PDF file has headers and footers. This would imply that your PDF is a Tagged PDF and that the header and the footer are marked as artifacts. If that is the case, than you should take advantage of the Tagged nature of the PDF, and extract the PDF as is done in the ParseTaggedPdf example:

TaggedPdfReaderTool readertool = new TaggedPdfReaderTool();
PdfReader reader = new PdfReader(StructuredContent.RESULT);
readertool.convertToXml(reader, new FileOutputStream(RESULT));
reader.close();

If this doesn't result in anything, you clearly don't have a Tagged PDF in which case there are no headers and footers in your document from a technical point of view. You may see headers and footers with your human eyes, but that doesn't mean that a machine sees these headers and footers. To a machine, it's just text like any other text in the page.

The ExtractPageContentArea example shows how we can define a rectangle that excludes the header and the footer when parsing for the content.

PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
    out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();

In this case, we have examined the document manually and we noticed that the actual text is always added inside the rectangle new Rectangle(70, 80, 490, 580). The header is added above Y coordinate 580 and below coordinate 80. By using the RegionTextRenderFilter we can extract the content excluding the content that doesn't overlap with the rectangle we have defined.

Question:

So, in this application we're using iText to fill out PDF forms and PDFBox to load that filled out PDF and convert to image into our system.

The problem is when the image is converted. All the information is there, but the checkboxes are... weird? Instead of the styled checkbox "check mark" that is set on the PDF, the checkboxes get a weird "empty box" inside of them.

How it is supposed to be:

PDFBox version 2.0.11 iText version is 5.5.13

Here is a little snippet of the code where the conversion happens:

PDDocument pdf = PDDocument.load(byteArrayInputStream);
PDFRenderer renderer = new PDFRenderer(pdf);
BufferedImage[] images = new BufferedImage[pdf.getNumberOfPages()];
PDPage page = null;
BufferedImage image = null;
for (int i = 0; i < images.length; i++) {
        try {
            image = renderer.renderImageWithDPI(i, 300,org.apache.pdfbox.rendering.ImageType.RGB);
            ...

I'm kind of sensing a "loss of quality" too after the conversion. Before, we were using PDFBox 1.8 and the conversion quality was low and it was losing some font formatting and style. Since the upgrade it got better, but is still bugged.

Where the filling happens:

PdfReader reader = new PdfReader(filePath);

ByteArrayOutputStream lStr = new ByteArrayOutputStream();
PdfStamper stamper = new PdfStamper(reader, lStr);
AcroFields acroFields = stamper.getAcroFields();

for (Entry<String, Item> map : acroFields.getFields().entrySet()) {
    String key = map.getKey();

    if (!fields.has(key))
        continue;

    if (fields.isNull(key))
        continue;

    acroFields.setField(key, fields.getString(key), true);
}
stamper.setFormFlattening(true);

stamper.close();
reader.close();

...

Do you guys know what this is?

Thanks!


Answer:

Got it working thanks to Tilman Hausherr's suggestion. The problem was indeed the fonts missing in the server running the application. (Zapf Dingbats and/or MS Gothic).

Installing the missing fonts in a directory "./fonts" or "/usr/share/fonts" (Linux) / "/Windows/Fonts" (Windows) did the trick!

Question:

I hava a pdf form. How to check with PDFBox v2.x.x if a pdf form was changed after the signature was added? I think the equivalent in itext 4.2.1 is the signaturecoverswholedocument method. The only thing i could find was how to check the signature itself:

        if (signerInformation.verify(new JcaSimpleSignerInfoVerifierBuilder().build(certificateHolder))) {

Answer:

No perfect answer, but a strategy to make a decision:

try (PDDocument document = PDDocument.load(new File(infile), password))
{
    for (PDSignature sig : document.getSignatureDictionaries())
    {
        int[] byteRange = sig.getByteRange();
        System.out.println("byteRange: " + Arrays.toString(byteRange));
        System.out.println("Range max: " + (byteRange[byteRange.length-2] + byteRange[byteRange.length-1]));
        // multiply content length with 2 (because it is in hex in the PDF) and add 2 for < and >
        System.out.println("Content len: " + (sig.getCOSObject().getString(COSName.CONTENTS).length() * 2 + 2));
        System.out.println("File len: " + new File(infile).length());
(...)

Now test this with this file. You'll get this output:

byteRange: [0, 192, 10094, 162062]
Range max: 172156
Content len: 9902
File len: 172156

The data that is signed starts at 0 with len 192, then the signature , then the rest of the data at 10094 with len 162062. You'll notice that 10094 + 162062 == 172156, and 192 + 9902 == 10094.

Of course if there are several signatures it won't look that perfect.

Question:

I am investigating Java PDF libraries.

I have a tried

org.apache.pdfbox

File file = new File("file.pdf");
PDDocument document = PDDocument.load(file);

// Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();

// Retrieving text from PDF document
String text = pdfStripper.getText(document);
System.out.println(text);

// Closing the document
document.close();

com.itextpdf.text.pdf

public static final String SRC = "file.pdf";
public static final String DEST = "streams";

public static void main(final String[] args) throws IOException {
    File file = new File(DEST);
    new BruteForce().parse(SRC, DEST);
}

public void parse(final String src, final String dest) throws IOException {
    PdfReader reader = new PdfReader(src);
    PdfObject obj;
    for (int i = 1; i <= reader.getXrefSize(); i++) {
        obj = reader.getPdfObject(i);

        if ((obj != null) && obj.isStream()) {
            PRStream stream = (PRStream) obj;
            byte[] b;
            try {
                b = PdfReader.getStreamBytes(stream);
            } catch (UnsupportedPdfException e) {
                b = PdfReader.getStreamBytesRaw(stream);
            }
            FileOutputStream fos = new FileOutputStream(String.format(dest, i));
            fos.write(b);
            fos.flush();
            fos.close();
        } else {
            final PdfDictionary pdfDictionary = (PdfDictionary) obj;

            System.out.println("\t>>>>> " + pdfDictionary + "\t\t" + pdfDictionary.getKeys());

            final Set<PdfName> pdfNames = pdfDictionary.getKeys();

            for (final PdfName pdfName : pdfNames) {
                final PdfObject pdfObject = pdfDictionary.get(pdfName);
                final int type = pdfObject.type();
                switch (type) {
                case PdfObject.NULL:
                    System.out.println("\t NULL " + pdfObject);
                    break;
                case PdfObject.BOOLEAN:
                    System.out.println("\t BOOLEAN " + pdfObject);
                    break;
                case PdfObject.NUMBER:
                    System.out.println("\t NUMBER " + pdfObject);
                    break;
                case PdfObject.STRING:
                    System.out.println("\t STRING " + pdfObject);
                    break;
                case PdfObject.NAME:
                    System.out.println("\t NAME " + pdfObject);
                    break;
                case PdfObject.ARRAY:
                    System.out.println("\t ARRAY " + pdfObject);
                    break;
                case PdfObject.DICTIONARY:
                    System.out.println("\t DICTIONARY " + ((PdfDictionary)pdfObject).getKeys());
                    break;
                case PdfObject.STREAM:
                    System.out.println("\t STREAM " + pdfObject);
                    break;
                case PdfObject.INDIRECT:
                    System.out.println("\t INDIRECT " +pdfObject.getIndRef());
                    break;
                default:

                }
                System.out.println("\t\t--- " + pdfObject.type());
            }
        }
    }
}

com.snowtide.pdf

String pdfFilePath = "file.pdf";

    Document pdf = PDF.open(pdfFilePath);

    final List<Annotation> annotations = pdf.getAllAnnotations();
    for (final Annotation annotation : annotations) {
        System.out.println(annotation.pageNumber());
    }

    System.out.println(pdf.getAttributeMap());
    System.out.println(pdf.getAttributeKeys());
    System.out.println("=============================");

    StringBuilder text = new StringBuilder(1024);
    pdf.pipe(new OutputTarget(text));
    pdf.close();
    System.out.println(text);

I can extract all visible PDF content including links, text, and images apart from what appears to be a "Watermark" that appears on every page.

Can PDF documents contain "unreachable" content?

Is there no way to extract ALL content from a PDF file?

UPDATE

thinking the "watermark" was an image I tried this code

File fileW = new File("file.pdf");
PDDocument document = PDDocument.load(fileW);
PDPageTree list = document.getPages();
for (PDPage page : list) {
    PDResources pdResources = page.getResources();
    for (COSName c : pdResources.getXObjectNames()) {

        System.out.println("????? ::>>>" + c);

        PDXObject o = pdResources.getXObject(c);
        if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
            File file = new File("Temp/" + System.nanoTime() + ".png");
            ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file);
        } else {

        }
    }
}

The PDF does contain images of the authors, however the "watermark" is not reached with this approach.


Answer:

The page content streams of the example document provided by the OP have the following structure from page 2 onward:

  1. A textual header line "www.electrophoresis-journal.com Page X Electrophoresis":

    BT
    /F1 9.12 Tf
    1 0 0 1 72.024 798.46 Tm
    /GS7 gs
    0 g
    0 G
    [(w)11(w)11(w)11(.)-12(e)-2(l)15(e)-2(c)23(t)-10(r)-8(o)26(pho)26(r)-8(e)-2(s)21(i)-10(s)] TJ
    ET
    [...]
    BT
    1 0 0 1 441.53 798.46 Tm
    [(E)6(l)-10(e)-2(c)23(t)-10(r)-8(o)26(pho)26(r)-8(e)23(s)-5(i)15(s)] TJ
    ET
    BT
    1 0 0 1 497.47 798.46 Tm
    [( )] TJ
    ET
    BT
    1 0 0 1 72.024 787.9 Tm
    [( )] TJ
    ET
    

    This text can easily be extracted using normal iText or PDFBox text extraction.

  2. A textual multi-line footer "Received: ... All rights reserved."

    BT
    1 0 0 1 72.024 109.7 Tm
    [(R)9(e)-2(c)23(e)-2(i)-10(v)26(e)-2(d:)41( )] TJ
    ET
    [...]
    BT
    1 0 0 1 72.024 47.76 Tm
    [(T)6(hi)-10(s)21( )-12(a)23(r)-8(t)15(i)-10(c)23(l)-10(e)23( )13(i)-10(s)21( )-12(pr)-8(o)26(t)15(e)-2(c)23(t)-10(e)-2(d)26( )-12(by)53( )-12(c)-2(o)26(p)-25(y)53(r)-8(i)-10(g)26(ht)-10(.)-12( )-12(A)38(l)-10(l)15( )13(r)-8(i)-10(g)26(ht)15(s)-5( )13(r)-8(e)23(s)-5(e)-2(r)-8(v)26(e)-2(d)26(.)] TJ
    ET
    BT
    1 0 0 1 278.52 47.76 Tm
    [( )] TJ
    ET
    BT
    1 0 0 1 72.024 37.2 Tm
    [( )] TJ
    ET
    

    This text also can easily be extracted using normal iText or PDFBox text extraction.

  3. A set of PDF path creation and filling operations using a custom graphics state forming the transparent "Accepted Article" writing on the left of the page:

    /GS8 gs
    0 g
    39.605 266.51 m
    39.605 261.29 39.605 256.06 39.605 250.84 c
    42.197 249.94 44.776 248.99 47.367 248.09 c
    49.296 247.41 50.704 247.08 51.649 247.08 c
    52.413 247.08 53.058 247.38 53.609 247.97 c
    54.191 248.54 54.548 249.82 54.729 251.77 c
    55.18 251.77 55.624 251.77 56.075 251.77 c
    56.075 247.51 56.075 243.26 56.075 239.02 c
    55.624 239.02 55.18 239.02 54.729 239.02 c
    54.36 240.72 53.903 241.8 53.314 242.3 c
    52.144 243.3 49.809 244.47 46.247 245.67 c
    32.719 250.33 19.286 255.25 5.7645 259.91 c
    5.7645 260.26 5.7645 260.61 5.7645 260.95 c
    19.43 265.57 33.014 270.43 46.679 275.05 c
    49.984 276.16 52.075 277.24 53.064 278.15 c
    54.053 279.06 54.623 280.36 54.729 282 c
    55.18 282 55.624 282 56.075 282 c
    56.075 276.68 56.075 271.35 56.075 266.03 c
    55.624 266.03 55.18 266.03 54.729 266.03 c
    54.623 267.64 54.303 268.75 53.753 269.31 c
    53.202 269.88 52.519 270.15 51.718 270.15 c
    50.666 270.15 48.97 269.75 46.679 268.95 c
    44.319 268.15 41.971 267.31 39.605 266.51 c
    h
    36.92 265.67 m
    30.284 263.43 23.686 261.05 17.045 258.81 c
    23.686 256.5 30.284 254.07 36.92 251.77 c
    36.92 256.4 36.92 261.04 36.92 265.67 c
    h
    f*
    [...]
    35.361 630.34 m
    40.294 630.31 44.156 631.32 46.967 633.29 c
    49.784 635.27 51.18 637.63 51.18 640.31 c
    51.18 642.1 50.573 643.67 49.364 645 c
    48.156 646.3 46.141 647.43 43.236 648.31 c
    43.48 648.62 43.712 648.93 43.962 649.24 c
    47.261 648.83 50.253 647.57 52.989 645.6 c
    55.731 643.62 57.089 641.06 57.089 638.05 c
    57.089 634.76 55.549 631.92 52.413 629.63 c
    49.302 627.3 45.158 626.1 39.899 626.1 c
    34.203 626.1 29.802 627.33 26.585 629.71 c
    23.405 632.07 21.834 635.12 21.834 638.73 c
    21.834 641.8 23.048 644.34 25.496 646.28 c
    27.981 648.22 31.267 649.24 35.361 649.24 c
    35.361 642.94 35.361 636.64 35.361 630.34 c
    h
    33.258 630.34 m
    33.258 634.56 33.258 638.78 33.258 643 c
    31.117 642.91 29.633 642.7 28.763 642.37 c
    27.417 641.87 26.341 641.14 25.571 640.16 c
    24.801 639.19 24.406 638.13 24.406 637.06 c
    24.406 635.42 25.158 633.91 26.729 632.64 c
    28.306 631.34 30.466 630.55 33.258 630.34 c
    h
    f*
    

    (The instructions I quoted draw the initial 'A' and the final 'e'.)

    This writing cannot be extracted using normal iText or PDFBox text extraction as it neither is drawn using text instruction nor is marked with an ActualText entry. (The latter could be recognized using customized iText or PDFBox text extraction.)

    But you can extract this writing as the sequence of path creation and drawing commands it consists of using an implementation of the iText ExtRenderListener interface or a subclass of the PDFBox PDFGraphicsStreamEngine.

  4. The actual text content of the article, opaque, using text drawing instructions, e.g.

    BT
    /F2 10.08 Tf
    1 0 0 1 72.024 760.78 Tm
    /GS7 gs
    0 g
    [(H)-7(I)8(G)16(H)-7( )-106(TH)-6(R)32(O)-7(U)8(G)16(H)-7(P)16(U)8(T )-106(M)-7(U)8(LTI)] TJ
    ET
    BT
    1 0 0 1 212.98 760.78 Tm
    [(-)] TJ
    ET
    BT
    1 0 0 1 216.1 760.78 Tm
    [(O)-7(R)8(G)-7(A)8(N)32( )-130(M)15(ETA)32(BO)-6(LO)16(M)-7(I)8(C)8(S)8( )-130(I)8(N)8( )-106(TH)-6(E)24( )-130(A)8(P)16(P)16(/)-7(P)16(S)8(1 )-106(M)-7(O)-7(U)8(S)8(E)24( )-130(M)15(O)-7(D)8(EL)24( )-106(O)-7(F)16( )] TJ
    ET
    

    This text also can easily be extracted using normal iText or PDFBox text extraction.

Concerning the OP's questions, therefore,

I can extract all visible PDF content including links, text, and images apart from what appears to be a "Watermark" that appears on every page.

Can PDF documents contain "unreachable" content?

That content is not "unreachable", it merely is not text drawn using text drawing instructions but instead text drawn like an arbitrary shape.

Is there no way to extract ALL content from a PDF file?

You can extract that content, merely not as text but instead as a collection of path creation and drawing instructions. Whenever you suspect such instructions to actually draw letter shapes, you can try to determine the text by rendering these paths as a bitmap and applying OCR.

Question:


Answer:

I found the solution.

According to the PDF specification, 'a text annotation represents a "sticky note" attached to a point in the PDF document.' Thus, neither the class PDAnnotationTextMarkup nor the subtype SUB_TYPE_POLYGON appears to match your requirements. Instead, you should use the PDAnnotationText class. As an aside, PDAnnotationTextMarkup is documented (JavaDocs) to be the abstract class that represents a text markup annotation. While it is not actually declared abstract, that characterization should make clear that it probably does not work without further ado.

so I used the below code and it worked like magic for me

PDRectangle position = new PDRectangle();
position.setUpperRightX(textPosition.getX());
position.setUpperRightY(ph - textPosition.getY());

position.setLowerLeftX(textPosition.getX()-4);
position.setLowerLeftY(ph - textPosition.getY());
PDGamma colourBlue = new PDGamma();
colourBlue.setB(1);

PDAnnotationText text = new PDAnnotationText();
text.setContents(commentNameWithComments.get(word));
text.setRectangle(position);
text.setOpen(true);
text.setConstantOpacity(50f);

assert annotations != null;
annotations.add(text);
page1.setAnnotations(annotations);
replaceText(word);

it might be useful for future devs :-)

Question:

Can I modify the font of existing PDF. I've some Type 3 (Custom) fonts with out any font descriptor. I would like to replace those to some meaningful Font. How can I do this using Itext or PDFBox ?


Answer:

A Type3 font is also known as a user defined font. Characters such as a, b, c,... are mapped to glyphs that are defined by a person. For instance a corresponds with the Apple logo, b corresponds with a glyph shaped as a banana, c looks like a coconut.

  • The shape of the apple, banana, coconut,... is stored in the PDF using PDF syntax. A PDF viewer such as Adobe Reader can draw the apple, banana, coconut,... by executing the font program that in this case consists of PDF path-constructing and path-painting operators and operands.
  • A PDF viewer doesn't have the intelligence to recognize these shapes as being the representation of an apple, banana, coconut,... A PDF viewer only knows that the user mapped these glyphs to characters such as a, b, c,... See also my answer to the question Why can't I extract text added using a Type3 font correctly?

Sometimes, people will match characters and glyphs, the way I did when I created a Type3 font for the iText logo: read Creating the iText logo witha Type3 font. However you shouldn't count on that. Any character can be mapped to any glyph.

Now that you know this, you should understand that you're trying to achieve something that is impossible. Suppose that you have a PDF with a Type3 font where the a character corresponds with an apple, the b character with a banana and the c character with a coconut, it won't be possible to automatically replace such a user-defined font with a custom encoding with another "normal" font that doesn't know how to draw apples, bananas and coconuts, and that uses a completely different encoding.

Question:

Assume my user went to a scanner in their office. The scanner is capable of generating a PDF of the scanned document. This is essentially the type of file that I have.

What I want to do is extract the text from this PDF. This is not a "first generation" pdf in the sense that the text is not embedded into the pdf. The text is embedded in the image that is in the PDF.

Is there functionality in iText of PDFBox that allows for this data to be retrieved? I am trying to avoid doing OCR on the image if possible. I was hoping there was something build into IText or PDFBox that does this.

Note that I am not talking about extracting "normal" text form a pdf as is outlined here: How to get raw text from pdf file using java


Answer:

Ok, after some looking around, there doesn't seem to be a way to do this specifically with iText or PDFBox, but it looks like PDFBox does have a plugin for third-party software that can accomplish what you need. If that is of interest, links are here and here, sourced from here (from @TilmanHausherr).

Question:

We have a requirement from the courts to submit PDFs in which there is embedded data in xml format. Such that the xml is in an XFA Form in the PDF.

We generate our PDF From Jasper Reports, then generate the xml that we want to add. The generated PDF from Jasper Reports does not have am XFA Form to start importing the xml into.

How do I get the PDF to have an XFA Form created in it so that I can then add the xml of the data to it?


Answer:

Thanks @BrunoLowagie Unfortunately, the format/rules of what we can use is out of our hands. The PDF is defined by the US Courts. But a co-worker was able to get me a copy of the PDF from the US Courts system that is an "interactive" form that is XFA, so I was able to get the data into the form with all the great work that you guys have done with iText.

Using the following code. All gotten from your great book.

AcroFields form = stamper.getAcroFields();
XfaForm xfa = form.getXfa();
xfa.fillXfaForm(new ByteArrayInputStream(xmlSchemaStream.toByteArray()));

For anyone reading this in the future. If you wanted to add XFA to your PDF via Java code, the ONLY current solution is Adobe LifeCycle which the app itself is ONLY available on Windows, and you have to install it to get the jar files to use. OH, wait, it gets even better, you include the client jar files into your application. Which means, yes you guessed it, you have to install a Server application and run it in JBoss, Weblogic or Websphere.

So in the end it will cost way too much money to setup, manage and maintain for something as stupid as adding an XFA form to your PDF.

Question:

I am reading the text from a page of a pdf document using iText. There are two exactly same lines in PDF, but the output after parsing is different for both lines. What could be the reason for the iText lib to spit out the text differently? Length of both lines (strings) are same.

iText methods used:

String text = PdfTextExtractor.getTextFromPage(reader, 1);

When I inspect 'text' element, the output is as below. However, these three lines seem to be exactly identical in the pdf.

XXXXXX XXXXX XXXXX : XXXXX :
#*2 1
XXXXXX XXXXX XXXXX : #*3 XXXXX : 2
XXXXXX XXXXX XXXXX : XXXXX :
#15 1

EDIT: Extra Question: When I used PDFBox, the parsed output is very different. Why is there a difference in the output text when using iText vs PDFBox?


Answer:

While in the screenshot the rows look like they are they are at a constant level each,

they actually are not. The 'XXX...:' and 'TOTAL :' parts are at y coordinates 469.45, 457.95, and 446.45 while the '#..', '1', and '2' parts are at y coordinates 468.65, 457.15, and 445.65.

To consider horizontal text to be on the same line, iText text extraction using the default text extraction strategy (LocationTextExtractionStrategy) requires the y coordinates to be the same after casting to int. (Actually this is somewhat simplified, for the whole picture look at LocationTextExtractionStrategy.TextChunkLocationDefaultImp)

In the case at hand this only is the case for the middle row, (int) 457.95 = 457 = (int) 457.15. Thus, default text extraction results in:

XXXXXX XXXXX XXXXX : TOTAL :
#*2 1
XXXXXX XXXXX XXXXX : #*3 TOTAL : 2
XXXXXX XXXXX XXXXX: TOTAL :
#15 1

In such situations you need a text extraction strategy which recognizes lines differently. If you e.g. use the HorizontalTextExtractionStrategy or HorizontalTextExtractionStrategy2 (depending on your iText version, the former one for up to iText 5.5.8, the latter one for newer iText 5.5.x versions) from this answer, you'll get:

XXXXXX XXXXX XXXXX : #*2 TOTAL : 1
XXXXXX XXXXX XXXXX : #*3 TOTAL : 2
XXXXXX XXXXX XXXXX: #15 TOTAL : 1

(Tested using TextExtraction.java test method testTest_pdf())


By the way, this does not mean that one should switch to HorizontalTextExtractionStrategy2 by default. This method has its disadvantages, too, in particular it looks at the the whole page (or at least the whole page section if extracting by filter) width to find lines. Thus, if your page e.g. has two columns of text nect to each other and lines are at the same approximate height only per column, this strategy will likely return utter garbage.

Addendum

The OP asked in a comment

Can you give me a brief explanation of what the HorizontalTextExtractionStrategy is doing?

While scanning the page, this strategy merely collects the text chunks from the text drawing instructions with their bounding box coordinates.

When asked for the resulting text, it in a first pass projects all these bounding boxes onto the y axis of the page coordinate system.

In the second pass it interprets each connected component of the image of this projection as the range of y coordinate of a single line: It iterates over these connected components top-to-bottom; for each component it takes all chunks projected into it, sorts them by their x coordinate, adds spaces where appropriate, and merges them to a text line.

Finally it returns the concatenation of these lines (with line feeds in-between).

LocationTextExtractionStrategy says "This renderer keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation. Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance, but different parallel distance is treated as being on the same line." That does not make a lot of sense to me.

Essentially it also is a two-pass strategy, in a first pass collecting all text chunks with coordinates and in a second one arranging them as lines. This strategy, though, takes the orientation of the baseline of the chunks into account and first sorts by the angle of baseline.

Among the chunks with the same baseline angle it considers chunks to belong to the same text line if their (bounded) baselines are on the same (unbounded) line.

The chunks considered to belong to the same text line then are sorted in the direction of the writing orientation and spaces are inserted where appropriate.

The comparisons made by this strategy all are based on int values and so allow for a tiny bit of variance

Question:

I have multi module maven project, in that there is a process of request generation and in this process there are some upload component of vaadin in these we are uploading some documents that must be only png, jpgs, pdf and bmp. Now at last of this process i am merging all the document types into one pdf and then downloading it with file downloader.

The function i am calling on a button click event is:

   /**
     * This function is responsible for getting 
     * all documents from request and merge 
     * them in a single pdf file for 
     * download purposes
     * @throws Exception 
     */
    protected void downloadMergedDocument() throws Exception {

    // Calling create pdf function for merged pdf
    createPDF();

    // Setting the merged file as a resource for file downloader
    Resource myResource = new FileResource(new File (mergedReportPath +request.getWebProtocol()+ ".pdf"));
    FileDownloader fileDownloader = new FileDownloader(myResource);

    // Extending the download button for download   
    fileDownloader.extend(downloadButton);

}

/**
 * This function is responsible for providing 
 * the PDF related to a particular request that 
 * contains all the documents merged inside it 
 * @throws Exception
 */
private void createPDF() throws Exception {
    try{
        // Getting the current request
        request = evaluationRequestUI.getRequest();

        // Fetching all documents of the request            
        Collection<DocumentBean> docCollection = request.getDocuments();

        // Initializing Document of using itext library
        Document doc = new Document();

        // Setting PdfWriter for getting the merged images file
        PdfWriter.getInstance(doc, new FileOutputStream(mergedReportPath+ "/mergedImages_" + request.getWebProtocol()+ ".pdf"));

        // Opening document
        l_doc.open();

        /**
         * Here iterating on document collection for the images type   
         * document for merging them into one pdf    
         */                                        
        for (DocumentBean documentBean : docCollection) {
            byte[] documents = documentBean.getByteArray();

            if(documentBean.getFilename().toLowerCase().contains("png") ||
                    documentBean.getFilename().toLowerCase().contains("jpeg") ||
                    documentBean.getFilename().toLowerCase().contains("jpg") ||
                    documentBean.getFilename().toLowerCase().contains("bmp")){

                Image img = Image.getInstance(documents);

                doc.setPageSize(img);
                doc.newPage();
                img.setAbsolutePosition(0, 0);
                doc.add(img);
            }
        }

        // Closing the document
        doc.close();

        /**
         * Here we get all the images type documents merged into 
         * one pdf, now moving to pdfbox for searching the pdf related 
         * document types in the request and merging the above resultant      
         * pdf and the pdf document in the request into one pdf
         */

        PDFMergerUtility utility = new PDFMergerUtility();

        // Adding the above resultant pdf as a source 
        utility.addSource(new File(mergedReportPath+ "/mergedImages_" + request.getWebProtocol()+ ".pdf"));

        // Iterating for the pdf document types in the collection
        for (DocumentBean documentBean : docCollection) {
            byte[] documents = documentBean.getByteArray();

            if(documentBean.getFilename().toLowerCase().contains("pdf")){
                utility.addSource(new ByteArrayInputStream(documents));
            }
        }

        // Here setting the final pdf name
        utility.setDestinationFileName(mergedReportPath +request.getWebProtocol()+ ".pdf");

        // Here final merging and then result
        utility.mergeDocuments();

    }catch(Exception e){
        m_logger.error("CATCH", e);
        throw e;
    }
}  

Note: mergedReportPath is a path defined for pdf files to be stored and then retreive from there for download purposes.

Now, i have two problems in that:

  1. When i do this process for a first request , it give me the pdfs in the destination folder but it does not download it.
  2. When i again do the this process for the second request, it get stuck on the utility.mergedocuments(), i mean if it found that the pdf is already present in the destination folder it get stuck. I dont know where the problem is. Please Help

Answer:

In the 2.0 version of PDFBox, you can set an output stream with setDestinationStream(). Thus, you just call

response.setContentType("application/pdf");
OutputStream os = response.getOutputStream();
utility.setDestinationStream(os);
utility.mergeDocuments();
os.flush();
os.close();

You can't set the response size this way; if you have to, use ByteArrayOutputStream like in Bruno's answer or this one.

Question:

I had embedded a byte array into a pdf file (Java). Now I am trying to extract that same array. The array was embedded as a "MOVIE" file.

I couldn't find any clue on how to do that...

Any ideas?

Thanks!

EDIT

I used this code to embed the byte array:

public static void pack(byte[] file) throws IOException, DocumentException{

    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
    writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
    writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);

    document.open();
    RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));

    PdfFileSpecification fs
        = PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
    PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
    RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
    RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
    RichMediaParams flashVars = new RichMediaParams();
    instance.setAsset(asset);
    configuration.addInstance(instance);
    RichMediaActivation activation = new RichMediaActivation();
    richMedia.setActivation(activation);
    PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
    richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
    writer.addAnnotation(richMediaAnnotation);
    document.close();

Answer:

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:

public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";

public static void main(String[] args) throws IOException {
    File file = new File(DEST);
    file.getParentFile().mkdirs();
    new ExtractStreams().parse(SRC, DEST);
}

public void parse(String src, String dest) throws IOException {
    PdfReader reader = new PdfReader(src);
    PdfObject obj;
    for (int i = 1; i <= reader.getXrefSize(); i++) {
        obj = reader.getPdfObject(i);
        if (obj != null && obj.isStream()) {
            PRStream stream = (PRStream)obj;
            byte[] b;
            try {
                b = PdfReader.getStreamBytes(stream);
            }
            catch(UnsupportedPdfException e) {
                b = PdfReader.getStreamBytesRaw(stream);
            }
            FileOutputStream fos = new FileOutputStream(String.format(dest, i));
            fos.write(b);
            fos.flush();
            fos.close();
        }
    }
}

Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:

  • When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
  • Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.

This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library

You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Question:

I have many PDFs (Version: 4) from 2007 which obviously have forms, but the AcroForm object in pdfbox and iText 5 is either empty or null.

Why do I believe that the PDFs contains forms? Because in the metadata I see references to XFD-files

For data privacy reason, I cannot provide the PDF files.

AcroForm/AcroFields
iText
AcroFields acroFields = reader.getAcroFields();
if (acroFields.getFields().size() == 0) {
  System.err.println("No acroFields");
  return;
}

output: No acroFields

pdfbox
PDDocumentCatalog docCatalog = doc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm == null) {
  System.err.println("No AcroForm");
  return;
}

if (acroForm.hasXFA()) {
  System.out.println("doc has XFA");
  acroForm.getXFA();
  return;
}

output No AcroForm

Metdata
PDF-Version: 4
CreationDate = D:20071019093057-04'00'
Producer = Acrobat Distiller 7.0 (Windows)
Author = name
Title = filename.xfd
Creator = PScript5.dll Version 5.2
ModDate = D:20071019093057-04'00'

XMP output

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Acrobat Distiller 7.0 (Windows)</pdf:Producer>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xap="http://ns.adobe.com/xap/1.0/">
         <xap:CreatorTool>PScript5.dll Version 5.2</xap:CreatorTool>
         <xap:ModifyDate>2007-10-19T09:30:57-04:00</xap:ModifyDate>
         <xap:CreateDate>2007-10-19T09:30:57-04:00</xap:CreateDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">filename.xfd</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>name</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
         <xapMM:DocumentID>uuid:6161773b-92f4-4954-a368-eed868c10438</xapMM:DocumentID>
         <xapMM:InstanceID>uuid:7737a837-0df8-4daa-9683-3547663fccaa</xapMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

Answer:

Why do I believe that the PDFs contains forms? Because in the metadata I see references to XFD-files

That merely means that the pdfs have been generated from some xfd file but they can simply contain a flat copy of the current contents of the xfd.

acroForm == null indicates that there is no valid central form structure in the pdf. There might at most be some lost form field widgets associated with some pages.

Question:

Is it possible to change the orientation of an existing document from 'portrait' to 'landscape' or vise versa?

I've tried copying a page (in portrait mode) to a newly created page (in landscape mode) using iText but it didn't work, the page orientation of the copied page was used.

Here's the code I used:

PdfReader originalFileReader = new PdfReader(src);

Document landscapeDoc = new Document(PageSize.A4.rotate());

PdfCopy copy = new PdfCopy(landscapeDoc, new FileOutputStream("/home/user/landscape.pdf"));

landscapeDoc.open();

for (int i = 1; i <= originalFileReader.getNumberOfPages(); i++) {
     copy.addPage(copy.getImportedPage(originalFileReader, i));
}

landscapeDoc.close();

Answer:

Getting a page from the original file and adding it to the copy doesn't re-layout the page. If you get a landscape page at all, it will simply contain a copy of the original page clipped to the height of the landscape page.

Looking at the iText site it appears that the 2 closest use-cases to what you want are extracting data fields (marked up using a template) from a PDF to an XML structure (pdf2Data) and adding content (watermarks, images, annotations, etc.) to an existing PDF. (Lots of examples here.)

There's nothing there about intelligently pulling content and formatting from a PDF and re-laying it out in a different PDF. (Which would be an extremely hard problem anyway.)

Question:

I have a requirment to embed a image into PDF. The PDF size and image size are the same (I have everything working up to this point). The difficult part is the end user has some text that they need to enter above some areas of the image, these areas will be predefined, but the text that goes in those positions are not. I know its possible to create fillable fields using tools like iText.. I have searched for days on how to use iText to accomplish setting up fillable field and positioning text of any kind to an absolute position, but frustratingly have made zero progress. So I could really use someones expertise on this subject. Thanks


Answer:

Your question isn't entirely clear, and the answer is different if you make different assumptions.

Assumption 1: Suppose that you have a PDF that consists of an image that fills the complete page. You now want to add text fields at positions that you know in advance.

In this case, you'd use PdfStamper and the addAnnotation() method as is done in the answer to the StackOverflow question How can I add a new AcroForm field to a PDF?

PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
// create a field for which you define the coordinates using a Rectangle
stamper.addAnnotation(field, 1);
stamper.close();

Here we add field to page 1 using the addAnnotation() method.

Now for the question: how to create that field object. That's easy. See for instance the ReadOnlyField example:

Rectangle rect = new Rectangle(36, 720, 144, 806);
TextField tf = new TextField(stamper.getWriter(), rect, "text");
tf.setOptions(TextField.MULTILINE);
PdfFormField field = tf.getTextField();

Note that I use the coordinate of the lower-left corner (36, 720) and the upper-right corner (144, 806) to create a Rectangle object. I create a TextField using the stamper's PdWriter instance, that rect and I give that field the name text. Assuming that you want the text that is entered to be wrapped, I made the text field a MULTILINE field. I then obtain a PdfFormFieldinstance from the TextField object.

Assumption 2: you are creating a PDF document from scratch in which you create a page to which you add an image with the same size of the page. Now you just want to add form fields to add text. There are many examples on how to define and add a text field on the official iText web site: MultiLineField, TextFields, GenericFields, CreateFormInTable, and many more.

You'll also find a good example in the question How to add a hidden text field?. The example in the question shows how to add a visible text field; the answer shows how to hide it.

In this example, x and y are the coordinates of the lower-left corner, whereas w and h are the width and the height of the field:

TextField field = new TextField(writer, new Rectangle(x, y - h, x + w, y), name);
field.BackgroundColor = new BaseColor(bgcolor[0], bgcolor[1], bgcolor[2]);
field.BorderColor = new BaseColor(
    bordercolor[0], bordercolor[1], bordercolor[2]);
field.BorderWidth = border;
field.BorderStyle = PdfBorderDictionary.STYLE_SOLID;
field.Text = text;
writer.AddAnnotation(field.GetTextField());

This is an iTextSharp example (written in C#), but it's very easy to port it to Java.

Finally: maybe you already knew all of this. Maybe you were just wondering what all these coordinates are about. The answer to this question can also be found on the official iText web site:

Almost all of the links in my answer refer to examples and answers that were written in answer to previous questions on StackOverflow. Please refrain from saying things like I have searched for days on how to use iText to accomplish setting up fillable field and positioning text of any kind to an absolute position because it is hard to believe for people who know that all the answers can be found on the official iText web site. Your boss might wonder which sites you were searching for all those days.

Question:

I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document

You can download a sample pdf with this problem from this link : http://myslams.com/test/2.pdf

is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?


Answer:

As the OP has not yet replaced his stale sample PDF link by a working one, the question can only be answered in general terms.

The code referenced by the OP (with the corrections in the answer of @Tilman) iterates the immediate image resources of each page and stores the respective files.

Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:

  1. On one hand it may not be used at all in the file or at least nowhere visible, merely a left-over from some prior PDF editing session.
  2. On the other hand multiple pages may have a shared resources dictionary containing all images on all these pages; in this case the OP's code exports many duplicates.

And the code may store too few images because there are other places where images may be put:

  1. Image data may be directly included in the page content stream, aka inline images.
  2. Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
  3. Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
  4. XFA forms may provide their own images, too.

As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.

EDIT

According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.

Thus, any kind of wrong output may also occur as a result of software orchestration issues.

Question:

I'm working on a small Scala/Java prototype where I have several PDF templates i.e. they have text and image placeholders and the placeholders should be replaced with some content. Some sections are also multiple i.e. the actual number of occurrences or repetitions depends on the input. Then finally I need to generate and append an extra PDF page.

I'm aware that these use-cases can be covered using iText. My question is whether I can use an alternative solution for this (and how to do it). I'd prefer to avoid commercial solutions for the time being.

UPDATE: I'd like my PDF templates to be created by professional designers. They will know where the placeholders will be but should have full control on the design aspects. This requirement discards solutions based solely on XML inputs or others where the PDF is created fully programmatically.


Answer:

Jasper - It's sofrware designed for crating dynamic reports connected with database inputs, but I think it can be utilized in the way you want. Has a graphical designer (either iReport or JaspersoftStudio based on your preferences), supports passing multiple variables with content or even images. Long static formatted texts may be a problem, but you will have to judge that for yourself.

JODReports + JODConverter - those two tools will allow your designers to work in pure ODT (OpenOffice format) files putting in dynamic data as variables that you will be able to substitute to your hearts desires from your Java code and print the output in PDF format. More than that, thanks to Java UNO API you can seize full control over the way your template looks and behaves if your inserted texts are really complex (though admittedly it's not intuitive to use).

Question:

I now have a PDF file that is rendered in PDFBox into a single image per page

// load pdf and save image
try (PDDocument document = PDDocument.load("some file")) {
  PDFRenderer render = new PDFRenderer(document);
  BufferedImage scaledImage = render.renderImageWithDPI(pageIndex, 326);
  // save image
}

The image saved in this step will be previewed in the browser. The user can drag and drop the image into this preview, and then I map this coordinate to the real PDF, but there is always some error. Here is how I mapped:

  1. Get the preview in the browser's width, height, get drag and drop images in the preview in the upper left corner of the x, y
  2. The backend fetches the PDF's actual width, height, and then computes the width, height, and height of the preview, resulting in a drag-and-drop image at the top left of the PDF in x, y
  3. Because the origin of coordinates in PDF is the lower left corner of the document, the final formula for x and y is:

    • x: float targetX = (previewX 1.0F / previewWidth) pdfPageWidth;
    • y: float targetY = pdfPageHeight - (previewY 1.0F / previewHeight) pdfPageHeight - dragImageHeight
  4. According to the previous calculation of x, y in this page PDF to draw this figure, but there are errors, and the error is obvious, how can I do?

Reference document

iText

Edit I also try use iText: ``` Rectangle cropBox = reader.getCropBox(firstPageIndex);

float widthRatio = renderRandomX * 1.0F / renderWidth;
float heightRatio = renderRandomY * 1.0F / renderHeight;

float offsetLLX = cropBox.getWidth() * widthRatio;
float offsetLLY = cropBox.getHeight() - cropBox.getHeight() * heightRatio;

Rectangle drawSignRect = new Rectangle(cropBox.getLeft() + cropBox.getWidth() * widthRatio,
    cropBox.getBottom() + offsetLLY,
    cropBox.getLeft() + offsetLLX + signImage.getWidth(),
    cropBox.getBottom() + offsetLLY + signImage.getHeight());

```


Answer:

Troubled almost a week, and finally solve the problem, the algorithm itself is no problem, but the third-party system will zoom the target image, calculate the position with this scaling is accurate.

Question:

I'm testing Java lib to edit existing pdf but the issue is that a can't load my existing pdf. I have the same result with iText and pdfbox, I can load the file the data seems here(pdf weigh ko) but the pdf created is empty (nothing display).

I'm doing it on a app engine server, with the two lib I can create pdf and display it in my browser with servlet or webservice.

I'm totaly lost, try tons of code but always the same result!

iText with importedPage :

    Document document = new Document();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    PdfWriter docWriter = PdfWriter.getInstance(document, baos);
    document.open();

    // Load existing PDF
    PdfReader reader = new PdfReader("WEB-INF/pdf.pdf");
    document.newPage();
    PdfImportedPage page = docWriter.getImportedPage(reader, 1);
    PdfPTable table = new PdfPTable(2);
    table.addCell(Image.getInstance(page));
    document.add(table);

    document.close();
    docWriter.close();

pdfbox :

     PDDocument document = new PDDocument();
     PDDocument sourceDocument = PDDocument.load("WEB-INF/pdf.pdf");
     PDPage templatePdfPage = (PDPage)sourceDocument.getDocumentCatalog().getAllPages().get(0);
     document.addPage(templatePdfPage);
     document.save(output);

Answer:

First of all Get the path Using ServletContext Servlet and using PDFBOx read the pdf file and save the pdf file in /WEB-INF/savedpdffiles/ folder.

Note: Create the folder savedpdffiles under WEB-INF folder.

See The JRE Class White List - A Java App Engine application's access to the classes in the Java standard library (the Java Runtime Environment, or JRE) is limited to the following classes:.

Read and save PDF file in Google AppEngine.

Code:

    PrintWriter printWriter = response.getWriter();
    try {
        ServletContext context = request.getSession().getServletContext();
        String pdffiles = context.getRealPath("/WEB-INF/");

        File readPath = new File(pdffiles);
        if (readPath.exists()) {
            String pdfFile = "04-Request-Headers.pdf"; // read this file to save in savedpdffiles folder
            File savedPath = new File(readPath.getAbsolutePath() +"/savedpdffiles/"); // create savedpdffiles folder under WEB-INF folder

            File readFullPath = new File(readPath.getAbsolutePath() + File.separatorChar + pdfFile);
            if (readFullPath.isFile()) {
                if(!savedPath.exists()) {
                    savedPath.createNewFile();// create new pdf file if not exits
                    printWriter.println( savedPath.getName() +" File created in -> "+ savedPath.getAbsolutePath());
                }

                PDDocument document = new PDDocument();
                PDDocument sourceDocument = PDDocument.load(readFullPath.getAbsolutePath()); // read the pdf file by PDDocument
                PDPage templatePdfPage = (PDPage) sourceDocument.getDocumentCatalog().getAllPages().get(0); // only first page is read out of 13 pages and save the first page.
                document.addPage(templatePdfPage);
                document.save(savedPath + "/" + pdfFile);
                document.close();
                sourceDocument.close();
                printWriter.print(pdfFile + " File saved to this location-> "+ savedPath.getAbsolutePath() + File.separatorChar + pdfFile);
            } else {
                printWriter.println(readFullPath.getName() + " File not exits in -> "+ readFullPath.getAbsolutePath());
            }
        } else {
            printWriter.println("Path not exists -> "+ readPath.getAbsolutePath());
        }
    } catch (Exception e) {
        printWriter.print("Type of Error occured while saving the PDF file -> "+ e.getMessage());
        e.printStackTrace();
    }

You will get below error

Caused by:
          java.lang.NoClassDefFoundError: java.awt.Color is a restricted class. Please see the Google  App Engine developer's guide for more details.
    Powered by Jetty://

Question:

I have several methods to manipulate my PDF files, such as convert them to .jpg images to make the compression. Now, I have a pdf file that doesn't have an X-Object, ie, I cannot turn it into jpg to compress them. Then i decided to grab the entire pdf file and try some way to compress it, I tried using iText Stamper and pdfBox.addCompression (deprecated) but none worked so far. Follow:

    public static byte[] compressPdf(final byte[] imageBytes) {
    try (ByteArrayOutputStream out = new ByteArrayOutputStream()){

        final PdfReader reader = new PdfReader(imageBytes);
        final PdfStamper stamper = new PdfStamper(reader, out, PdfWriter.VERSION_1_7);

        stamper.getWriter().setFullCompression();
        stamper.getWriter().setCompressionLevel(9);

        int total = reader.getNumberOfPages() + 1;
        for (int i = 1; i < total; i++) {
            reader.setPageContent(i, reader.getPageContent(i));
        }

        stamper.close();
        reader.close();

        return out.toByteArray();
    } catch (Exception e) {
        e.printStackTrace();
    }

    return null;
}

Notice that stamper.fullcompression or stamper.setcompressionlevel aren't working.


Answer:

The PDF document you are displaying is merely a wrapper round an image.

Allow me to elaborate.

Normally, a PDF consists of instructions for a viewer. Something like:

  • go to coordinates 50, 50
  • set the font to Helvetica, size 12
  • draw the glyph for character 'H'
  • etc

These instructions are gathered into objects. And similarly, the resources they use (like images, fonts, etc) are also grouped into objects.

Each object gets assigned a number. Those are the numbers in the XREF.

When iText attempts to apply compression, it will go looking for object streams (so streams of instructions and fonts, etc) and will attempt to compress those.

Your PDF contains only 1 image.

iText will not compress your image (since that may result in loss of quality).

What you can do:

  • do not use scanned document, use 'real' PDF documents (your end-users will be grateful)
  • extract the image from your PDF (using iText), compress the image (using an image processing library), re-insert the image into the resources.

Question:

I'm trying to convert .doc to .pdf, but I got this exception and I don't know how to fix it.

java.io.IOException: Missing root object specification in trailer
at   org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2042)

This is where the exception is thrown:

PDDocument pdfDocument = PDDocument.load(convertDocToPdf(documentInputStream));

Here is my conversion method:

private byte[] convertDocToPdf(InputStream documentInputStream) throws Exception {
    Document document = null;
    WordExtractor we = null;
    ByteArrayOutputStream out = null;
    byte[] documentByteArray = null;
    try {
        document = new Document();
        POIFSFileSystem fs = new POIFSFileSystem(documentInputStream);

        HWPFDocument doc = new HWPFDocument(fs);
        we = new WordExtractor(doc);
        out = new ByteArrayOutputStream();
        PdfWriter writer = PdfWriter.getInstance(document, out);

        Range range = doc.getRange();
        document.open();
        writer.setPageEmpty(true);
        document.newPage();
        writer.setPageEmpty(true);

        String[] paragraphs = we.getParagraphText();
        for (int i = 0; i < paragraphs.length; i++) {
            org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
            paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
            document.add(new Paragraph(paragraphs[i]));
        }
        documentByteArray = out.toByteArray();
    } catch (Exception ex) {
        ex.printStackTrace(System.out);
        throw new Exception(STATE.FAILED_CONVERSION.name());
    } finally {
        document.close();
        try {
            we.close();
            out.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    return documentByteArray;
}

Answer:

You use iText classes and do

documentByteArray = out.toByteArray();

before you finish the document

document.close();

Thus, the documentByteArray only contains an incomplete PDF which PDFBox complains about.

Question:

I need to read the xref table in PDF, and replace all free (signed with 'f' at the end) with text string from file. This is the example of xref table in PDF.

xref

0 256

0000000029 65535 f

0000000017 00000 n

0000000125 00000 n

0000000216 00000 n

0000000030 65535 f

0000000031 65535 f

0000000032 65535 f

and I want to replace with string [A443DD719B11118D12D99E5EA18E5EA9934] then it'll become:

0000000029 65535 f

0000000017 00000 n

0000000125 00000 n

0000000216 00000 n

0000000030 A443D f

0000000031 D719B f

0000000032 11118 f

. . .

I'm working with iText or PDFBox in Java but can't find the way how to read or access the stream of xref table and replace it with text from a file. Please help.


Answer:

Every indirect object in a PDF has a unique identifier as defined in ISO 32000-1:

The object identifier shall consist of two parts:

  • A positive integer object number. Indirect objects may be numbered sequentially within a PDF file, but this is not required; object numbers may be assigned in any arbitrary order.
  • A non-negative integer generation number. In a newly created file, all indirect objects shall have generation numbers of 0. Nonzero generation numbers may be introduced when the file is later updated.

This identifier is used in the cross-reference table, and there's a maximum number for generation numbers in an ordinary cross-reference stream:

The maximum generation number is 65,535

You want to change the generation number into something that isn't a number. And even if you'd see strings such as D719B as (hexadecimal) numbers, you'd still be exceeding the maximum generation number.

In other words: you ask PDF specialists to create PDFs that do not comply with the ISO standard. Every PDF expert with some self-respect will refuse to answer that question and ask you to reconsider.

In the comments, you claim that you want to introduce some invisible watermark into a PDF file. Why do you want to abuse the concept of generation numbers to do this? Why don't you just add an extra (custom) entry to the catalog?