Hot questions for Using PDFBox in apache

Question:

We are planning to migrate our pdf generation utilities from iText to PDFBox (Due to licensing issues in iText). With some effort, I was able to write and position text, draw lines etc. But creating Tables with text embedded in Table cells is a challenge, I went through the documentation, examples, Google, Stackoverflow couldn't find a thing. Was wondering if PDFBox provides native support for creating Tables with embedded text. My last resort would be to use this link https://github.com/eduardohl/Paginated-PDFBox-Table-Sample


Answer:

Since I also needed table drawing functionality for a side project, I implemented a small "table drawer" library myself, which I uploaded to github.

In order to produce such a table – for instance – ...

... you would need this code. In the same file you find the code for that table as well:

The current "feature list" includes:

  • set font and font size on table level as well as on cell level
  • define single cells with bottom-, top-, left- and right-border width separately
  • define the background color on row or cell level
  • define padding (top, bottom, left, right) on cell level
  • define border color (on table, row or cell level)
  • specify text alignment (vertical and horizontal)
  • cell spanning and row spanning
  • text wrapping and line spacing

Also it should not be too hard to add missing stuff like having different border colors for borders on top, bottom, left and right-borders, if needed.

Question:

I'm trying to use the Apache PDFBox library to create a PDF document programmatically. The class PDPageContentStream contains methods to write text, draw lines, bezier curves, rectangles. But I can't find a way to draw a simple filled circle. Is there a way to draw it using this library? If not, can you please suggest a free Java library that provides flexible API to create PDF documents programmatically? Thanks in advance.


Answer:

OK, thanks everyone for responses. I like the solution with bezier curves. This approach works for me:

private void drawCircle(PDPageContentStream contentStream, int cx, int cy, int r, int red, int green, int blue) throws IOException {
    final float k = 0.552284749831f;
    contentStream.setNonStrokingColor(red, green, blue);
    contentStream.moveTo(cx - r, cy);
    contentStream.curveTo(cx - r, cy + k * r, cx - k * r, cy + r, cx, cy + r);
    contentStream.curveTo(cx + k * r, cy + r, cx + r, cy + k * r, cx + r, cy);
    contentStream.curveTo(cx + r, cy - k * r, cx + k * r, cy - r, cx, cy - r);
    contentStream.curveTo(cx - k * r, cy - r, cx - r, cy - k * r, cx - r, cy);
    contentStream.fill();
}

Question:


Answer:

This is possible using a Splitter.

This is a sample code that will split a document on every page:

PDDocument document = PDDocument.load(myPDF);
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(document);

You can control the number of pages on every splitted PDF using setSplitAtPage(split).

Question:

I have a simple JAVA code that uses TIKA library to get the metadata of a PDF file and it lists the below metadata.

Tika code:

Metadata metadata = new Metadata();
tika.parse(file, metadata);
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
    System.out.println(name + " : " + metadata.get(name));
}

Output:

date : 1996-11-19T09:00:46Z
pdf:PDFVersion : 1.1
access_permission:modify_annotations : true
access_permission:can_print_degraded : true
dcterms:created : 1996-10-22T07:44:27Z
Last-Modified : 1996-11-19T09:00:46Z
dcterms:modified : 1996-11-19T09:00:46Z
dc:format : application/pdf; version=1.1
title : Test
Last-Save-Date : 1996-11-19T09:00:46Z
access_permission:fill_in_form : true
meta:save-date : 1996-11-19T09:00:46Z
pdf:encrypted : false
dc:title : Test
modified : 1996-11-19T09:00:46Z
Content-Type : application/pdf
meta:creation-date : 1996-10-22T07:44:27Z
created : Tue Oct 22 00:44:27 PDT 1996
access_permission:extract_for_accessibility : true
access_permission:assemble_document : true
xmpTPg:NPages : 64
Creation-Date : 1996-10-22T07:44:27Z
access_permission:extract_content : true
access_permission:can_print : true
producer : Acrobat Distiller 2.1 for Power Macintosh
access_permission:can_modify : true

I am using the below code that uses PDF box to get the metadata but I don't want to specify the metadata key rather I would like to get all the available metadata keys and iterate over them.

What is the best way to generically access all the metadata key/value pair when using PDF box library?

public static void main(String args[]) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("test/test.pdf");
        try {

            PDFParser parser = new PDFParser(new FileInputStream(file));
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            String parsedText = pdfStripper.getText(pdDoc);
            // System.out.println(parsedText);

            PDDocumentCatalog cat = pdDoc.getDocumentCatalog();
            PDMetadata metadata = cat.getMetadata();

            if (metadata != null) {
                System.out.println(metadata.getInputStreamAsString());
            }

            printMetadata(pdDoc);

    } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }


public static void printMetadata(PDDocument document) throws IOException {
        PDDocumentInformation info = document.getDocumentInformation();
        PDDocumentCatalog cat = document.getDocumentCatalog();
        PDMetadata metadata = cat.getMetadata();

        System.out.println("Page Count=" + document.getNumberOfPages());
        System.out.println("Title=" + info.getTitle());
        System.out.println("Author=" + info.getAuthor());
        System.out.println("Subject=" + info.getSubject());
        System.out.println("Keywords=" + info.getKeywords());
        System.out.println("Creator=" + info.getCreator());
        System.out.println("Producer=" + info.getProducer());
        System.out.println("Creation Date=" + formatDate(info.getCreationDate()));
        System.out.println("Modification Date=" + formatDate(info.getModificationDate()));
        System.out.println("Trapped=" + info.getTrapped());   
        if (metadata != null) {
            System.out.println("Metadata=" + metadata.getStream());
        }
    }

Output:

 Page Count=64
    Title=test
    Author=null
    Subject=null
    Keywords=null
    Creator=null
    Producer=Acrobat Distiller 2.1 for Power Macintosh
    Creation Date=10/22/96 12:44 AM
    Modification Date=11/19/96 1:00 AM
Trapped=null

Answer:

Sorry, there is no easy way to iterate through all metadata values. You could go meta (sorry) and use reflection on the PDDocumentInformation object and iterate through the getters, but then you'd also have to handle the different return types. At that point, you may as well just hardcode what you've done above.

And, that's just for the PDDocumentInformation object.

Navigating through the XMP, where the really fun metadata can live, is even more interesting because it can contain different schemas (DublinCore, XMPMM and many more, see e.g. Jempbox), and even custom metadata.

Over on Tika, we're trying to make more and more of the XMP metadata available (just added XMPMM, and will soon add Photoshop)...if you have any requests, please let us know.

Finally, if you do start working with XMP and PDFBox, I'd recommend sticking with Jempbox for a while (see this).

Question:

I have a code to attach a file to a PDF file.

PDDocument doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage(page);

// read attachment file
File file = new File("/Users/TMac/Projects/Web/dir/index.html");
FileInputStream inputStream = new FileInputStream(file);

PDEmbeddedFile pdEmbeddedFile = new PDEmbeddedFile(doc, inputStream );
pdEmbeddedFile.setSubtype( "application/octet-stream" );

PDComplexFileSpecification fs = new PDComplexFileSpecification();
fs.setEmbeddedFile( pdEmbeddedFile );
fs.setFile("index.html");

int offsetX = 20;
int offsetY = 600;

PDAnnotationFileAttachment txtLink = new PDAnnotationFileAttachment();
txtLink.setFile(fs);

// Set the rectangle containing the link
PDRectangle position = new PDRectangle();
position.setLowerLeftX(offsetX);
position.setLowerLeftY(offsetY);
position.setUpperRightX(offsetX + 20);
position.setUpperRightY(offsetY + 20);
txtLink.setRectangle(position);

page.getAnnotations().add(txtLink);

doc.save("/Users/TMac/Projects/PDF/outputFiles/testHTML.pdf");
doc.close();

The problem is the attachment icon is looks like this:

I need to replace this icon with custom image.I have found some examples related to text links. when i click on that image it should open the file. Attachment code (above code) working fine. How can i add a custom image as a thumbnail ?


Answer:

You need to create an appearance stream:

PDAnnotationFileAttachment txtLink = new PDAnnotationFileAttachment();
txtLink.setFile(fs);
// Set the rectangle containing the link
int offsetX = 20;
int offsetY = 600;
PDRectangle position = new PDRectangle();
position.setLowerLeftX(offsetX);
position.setLowerLeftY(offsetY);
position.setUpperRightX(offsetX + 20);
position.setUpperRightY(offsetY + 20);
txtLink.setRectangle(position);

PDAppearanceDictionary appearanceDictionary = new PDAppearanceDictionary();
PDAppearanceStream appearanceStream = new PDAppearanceStream(doc);
appearanceStream.setResources(new PDResources());
PDRectangle bbox = new PDRectangle(txtLink.getRectangle().getWidth(), txtLink.getRectangle().getHeight());
appearanceStream.setBBox(bbox);
try (PDPageContentStream cs = new PDPageContentStream(doc, appearanceStream))
{
    PDImageXObject image = PDImageXObject.createFromFile("image.jpg", doc);
    cs.drawImage(image, 0, 0);
}
appearanceDictionary.setNormalAppearance(appearanceStream);
txtLink.setAppearance(appearanceDictionary);

page.getAnnotations().add(txtLink);

Question:

PDFBOX-4450 Details on Issue

Not sure if anyone has encountered this issue, but am getting an outofmemory exception when validating pdf's. Posting here for visibility, if anyone could help that would be awesome.

If anyone has any ideas, please share. At this point I can't really move forward.

Stuff I've tried

  • Followed suggestions in wiki without success PDFBox faq

  • Increased max heap size from 2GB to 4GB

  • Removed jvm arg:-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider

  • Tried using jdk 1.7

  • Used a scratch file (from wiki)
  • Disabled the cache for PDImageXObject (from wiki)

My Environment

  • Linux 64 bit (arch linux)
  • Java 8
  • PDFBox/Preflight ver. 2.0.13
  • jbig imageio ver. 3.0.2

Java info

java -version

java version "1.8.0_131"

Java(TM) SE Runtime Environment (build 1.8.0_131-b11)

Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

JVM Args used

java -Xmx2048m -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider

Example pdf

Pdf from PDFBOX-4450

Console Output

Jan 30, 2019 10:25:58 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
WARNING: Using fallback font ArialMT for base font Symbol
Jan 30, 2019 10:25:58 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
WARNING: Using fallback font ArialMT for base font ZapfDingbats
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1587)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1587)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1587)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1587)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1587)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.getDictionaryString(COSDictionary.java:1559)
at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1531)
at org.apache.pdfbox.preflight.xobject.XObjFormValidator.checkGroup(XObjFormValidator.java:138)
at org.apache.pdfbox.preflight.xobject.XObjFormValidator.validate(XObjFormValidator.java:73)
at org.apache.pdfbox.preflight.process.reflect.GraphicObjectPageValidationProcess.validate(GraphicObjectPageValidationProcess.java:74)
at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:84)
at org.apache.pdfbox.preflight.utils.ContextHelper.validateElement(ContextHelper.java:57)
at org.apache.pdfbox.preflight.process.reflect.ResourcesValidationProcess.validateXObjects(ResourcesValidationProcess.java:224)
at org.apache.pdfbox.preflight.process.reflect.ResourcesValidationProcess.validate(ResourcesValidationProcess.java:81)
at org.apache.pdfbox.preflight.utils.ContextHelper.callValidation(ContextHelper.java:84)

Sample code

import java.io.File;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.ValidationResult;
import org.apache.pdfbox.preflight.ValidationResult.ValidationError;
import org.apache.pdfbox.preflight.parser.PreflightParser;

public class Validator {
  private File file = null;
  private List<ValidationError> errorList = new ArrayList<ValidationError>();

  public Validator(File file) {
    this.file = file;
  }

  public List<ValidationError> getErrors(){
    return errorList;
  }

  public boolean validate() throws Exception{
    PreflightParser parser = null;
    PreflightDocument document = null;
    ValidationResult result = null;
    try {
      parser = new PreflightParser(file);
      parser.parse();
      document = parser.getPreflightDocument();
      document.validate();
      result = document.getResult();
      errorList = result.getErrorsList();
    }
    catch(Exception e) {
      throw e;
    }
    finally {
      if(document != null) {
        try {
          document.close();
        }catch(Exception ignored) {}
      }
      parser = null;
      document = null;
      result = null;
    }
    return errorList.size() > 0 ? true : false;
  }
}

Answer:

When I add these options:

-XX:+HeapDumpOnOutOfMemoryError -Xmx3550m -Xms3550m -Xmn2g 

It failed again. And I use VisualVM to analysis the dump heap file. I found something interesting.

And most of char[]'s content is:

And I find the code in

//org.apache.pdfbox.preflight.process.reflect.SinglePageValidationProcess#validateGroupTransparency
    protected void validateGroupTransparency(PreflightContext context, PDPage page) throws ValidationException
    {
        COSBase baseGroup = page.getCOSObject().getItem(XOBJECT_DICTIONARY_KEY_GROUP);
        COSDictionary groupDictionary = COSUtils.getAsDictionary(baseGroup, context.getDocument().getDocument());
        if (groupDictionary != null)
        {
            String sVal = groupDictionary.getNameAsString(COSName.S);
            if (XOBJECT_DICTIONARY_VALUE_S_TRANSPARENCY.equals(sVal))
            {
                context.addValidationError(new ValidationError(ERROR_GRAPHIC_TRANSPARENCY_GROUP,
                        "Group has a transparency S entry or the S entry is null"));
            }
        }
    }

It create a ValidationError object, but the constructor is:

public ValidationError(String errorCode, String details, Throwable cause)
        {
            this(errorCode);
            if (details != null)
            {
                StringBuilder sb = new StringBuilder(this.details.length() + details.length() + 2);
                sb.append(this.details).append(", ").append(details);
                this.details = sb.toString();
            }
            this.cause = cause;
            t = new Exception();
        }

You can see that, once there is a error, it create the ValidationError and create a StringBuilder.

So, you have three ways to solve the problem:

  1. You can extend you heap size. 4G is not enough, try 16G or more.
  2. Don't use PDFBox library.
  3. Change the PDFBox source code.
    public ValidationError(String errorCode, String details, Throwable cause)
    {
        this(errorCode);
        if (details != null)
        {
            String key = errorCode + details;
            if (commonDetailMap.containsKey(key)) {
                this.details = commonDetailMap.get(key);
            } else {
                StringBuilder sb = new StringBuilder(this.details.length() + details.length() + 2);
                sb.append(this.details).append(", ").append(details);
                this.details = sb.toString();
                commonDetailMap.put(key, this.details);
            }

        }
        this.cause = cause;
        t = new Exception();
    }

I think using a Map to avoid creating too may StringBuilder would work. But the Map would be too large if the error code and details are multivalued.

So, the another way to change the source code is:

    public ValidationError(String errorCode, String details, Throwable cause)
    {
        this(errorCode);
        if (details != null)
        {
            StringBuilder sb = new StringBuilder(this.details.length() + details.length() + 2);
            sb.append(this.details).append(", ").append(details);
            // invoke intern
            this.details = sb.toString().intern();
        }
        this.cause = cause;
        t = new Exception();
    }

The intern() is:

Returns a canonical representation for the string object.

I think that using intern() is better.

Question:

I want to keep track of the y-coordinate when generating pdf. This is how I am currently doing it.

    PDRectangle mediabox = page.findMediaBox();
    float margin = 15;
    float y = mediabox.getUpperRightY() - margin;
    float fontSize = 10f;
    PDType1Font font = PDType1Font.HELVETICA;

    contentStream.showText("Hello");
    y = y - fontSize;           //decrease y-coordinate
    contentStream.newLine();    //go to new line
    contentStream.showText("World!");
    y = y - fontSize;           //decrease y-coordinate

What is the height of new line so that I can precisely keep track of the y-coordinate?

I need something like this.

    contentStream.showText("Hello");
    y = y - fontSize;           //decrease y-coordinate
    contentStream.newLine();    //go to new line
    y = y - newLineSize;        <---- require the height of new line.
    contentStream.showText("World!");
    y = y - fontSize;           //decrease y-coordinate

Thank you.


Answer:

The operator created by newLine() starts a new line taking the start of the current line and subtracting the leading from the y coordinate, a value you can set using setLeading.

Question:

I can see only 4 fonts with variants in PDType1Font. Is there any way I can use other / custom fonts?

PDFType1Font fonts

  public static final PDType1Font TIMES_ROMAN = new PDType1Font("Times-Roman");
    public static final PDType1Font TIMES_BOLD = new PDType1Font("Times-Bold");
    public static final PDType1Font TIMES_ITALIC = new PDType1Font("Times-Italic");
    public static final PDType1Font TIMES_BOLD_ITALIC = new PDType1Font("Times-BoldItalic");
    public static final PDType1Font HELVETICA = new PDType1Font("Helvetica");
    public static final PDType1Font HELVETICA_BOLD = new PDType1Font("Helvetica-Bold");
    public static final PDType1Font HELVETICA_OBLIQUE = new PDType1Font("Helvetica-Oblique");
    public static final PDType1Font HELVETICA_BOLD_OBLIQUE = new PDType1Font("Helvetica-BoldOblique");
    public static final PDType1Font COURIER = new PDType1Font("Courier");
    public static final PDType1Font COURIER_BOLD = new PDType1Font("Courier-Bold");
    public static final PDType1Font COURIER_OBLIQUE = new PDType1Font("Courier-Oblique");
    public static final PDType1Font COURIER_BOLD_OBLIQUE = new PDType1Font("Courier-BoldOblique");
    public static final PDType1Font SYMBOL = new PDType1Font("Symbol");
    public static final PDType1Font ZAPF_DINGBATS = new PDType1Font("ZapfDingbats");

Answer:

You can load truetype fonts like this in Apache PDFBox 2.0.*:

PDType0Font font = PDType0Font.load(document, new File("c:/windows/fonts/simhei.ttf"));

See also the API documentation and the EmbeddedFonts.java example.

Question:

I have a simple java console application. pdfbox is utilized to extract text from PDF files. But there is continuous info printed in console:

十一月 29, 2017 9:28:27 下午 org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
警告: No Unicode mapping for 14 (145) in font GGNHDZ+SimSun  
十一月 29, 2017 9:28:27 下午 org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
警告: No Unicode mapping for 28 (249) in font LNKLJH+SimSun
十一月 29, 2017 9:28:27 下午 org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode

I really want to remove this information from the console. And I use logback for logging, the logback.xml is just like:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<logger name="org.apache.pdfbox" level="ERROR"/>
<timestamp key="timestamp-by-second" datePattern="yyyyMMdd'T'HHmmss"/>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <!-- encoder 默认配置为PatternLayoutEncoder -->
    <encoder>
        <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
    </encoder>
</appender>
<appender name="FILE" class="ch.qos.logback.core.FileAppender">
    <file>logs/test-${timestamp-by-second}.log</file>
    <append>true</append>
    <encoder>
        <pattern>%-4relative [%thread] %-5level %logger{35} - %msg%n
        </pattern>
    </encoder>
</appender>
<root level="ERROR">
    <appender-ref ref="FILE" />
    <appender-ref ref="STDOUT" />
</root>

I have find some answer say that should change the Level. I have changed the level to ERROR. But still not work. I am doubting if the info has something with logback.xml. Because when I remove STDOUT, the pdfbox warn info still print in the console.

Anybody know this case? Thank you in advance.


Answer:

If the logging was being emitted by Logback then the approach you have tried, for example ...

  • Adding <logger name="org.apache.pdfbox" level="ERROR"/>
  • Removing the STDOUT appender

... would work.

However, PDFBox doesn't use Logback, instead it uses Apache Commons Logging (http://commons.apache.org/logging/). There are several ways of disabling Commons Logging:

  • Disable Commons Logging entirely by adding the following to your Main class' static initialiser block, this must be executed before PDFBOX creates a Log instance:

    static {
        System.setProperty("org.apache.commons.logging.Log",
                     "org.apache.commons.logging.impl.NoOpLog");
    }
    
  • Disable Commons Logging by passing the following JVM parameter when you start your application:

    -Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.NoOpLog` 
    
  • Disable Commons Logging for the PDFBOX namespace by adding the following to your Main class' static initialiser block, this *must** be executed before PDFBOX creates a Log instance (note: you could alternatively use Level.SEVERE, depending on how much tolerance you have for PDFBOX's log output):

    java.util.logging.Logger.getLogger("org.apache.pdfbox")
        .setLevel(java.util.logging.Level.OFF);
    

Question:

I have been trying to create a triangle in pdf using the apache pdf box. Using PDShadingType4 class. Below is the code implementation but it created only empty pdf. I didn't found any implementation of PDShadingType4 in examples provided in apache.

The generated triangle should look like the triangle on the bottom left of the pdf at link which is found in apache pdf box issue

I am not able to find any shading example using PDShadingType4.

Is below implementation correct ? or their is some other way to achieve shading(triangular) using PDShadingType4


    import java.io.IOException;
    import org.apache.pdfbox.cos.COSArray;
    import org.apache.pdfbox.cos.COSFloat;
    import org.apache.pdfbox.cos.COSInteger;
    import org.apache.pdfbox.cos.COSName;
    import org.apache.pdfbox.cos.COSStream;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.pdmodel.PDPageContentStream;
    import org.apache.pdfbox.pdmodel.common.function.PDFunctionType2;
    import org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB;
    import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
    import org.apache.pdfbox.pdmodel.graphics.shading.PDShadingType4;

    public class TriangleGraident2 {

        public void create(String file) throws IOException {
            PDDocument document = null;
            try {
                document = new PDDocument();
                PDPage page = new PDPage();
                document.addPage(page);

                PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, false);

                contentStream.moveTo(38, 17);

                COSStream fdict = new COSStream();
                fdict.setInt(COSName.FUNCTION_TYPE, 2);

                COSArray cosArray = new COSArray();
                cosArray.add(COSInteger.get(104));
                cosArray.add(COSInteger.get(83));
                cosArray.add(COSInteger.get(170));
                cosArray.add(COSInteger.get(17));
                cosArray.add(COSInteger.get(38));
                cosArray.add(COSInteger.get(17));


                /*Setting color */
                COSArray c0 = new COSArray();
                c0.add(COSFloat.get("1"));
                c0.add(COSFloat.get("0"));
                c0.add(COSFloat.get("0"));
                COSArray c1 = new COSArray();
                c1.add(COSFloat.get("0.5"));
                c1.add(COSFloat.get("1"));
                c1.add(COSFloat.get("0.5"));
                /*Setting color*/


                COSArray decode = new COSArray();
                decode.add(COSFloat.get("0.0"));
                decode.add(COSFloat.get("1.0"));
                decode.add(COSFloat.get("0.0"));
                decode.add(COSFloat.get("1.0"));
                decode.add(COSFloat.get("0.0"));

                fdict.setItem(COSName.C0, c0);
                fdict.setItem(COSName.C1, c1);

                PDFunctionType2 func = new PDFunctionType2(fdict);
                PDShadingType4 shading = new PDShadingType4(fdict);
                shading.setColorSpace(PDDeviceRGB.INSTANCE);
                shading.setShadingType(PDShading.SHADING_TYPE4);

                shading.getCOSObject().setInt(COSName.LENGTH, 32);

                shading.setBitsPerCoordinate(24);
                shading.setBitsPerComponent(16);
                shading.setBitsPerFlag(8);
                shading.getCOSObject().setItem(COSName.COORDS, cosArray);
                shading.setDecodeValues(decode);
                shading.setFunction(func);
                contentStream.shadingFill(shading);
                contentStream.close();
                document.save(file);
                document.close();

            }
            finally {
                if (document != null) {
                    document.close();
                }
            }
        }

        public static void main(String[] args) throws IOException {
            TriangleGraident2 creator = new TriangleGraident2();
            creator.create("C:\\Users\\abc\\Desktop\\triangle_image.pdf");
        }
    }


Answer:

This code creates a Gouraud shaded triangle on the bottom left:

// See PDF 32000 specification,
// 8.7.4.5.5 Type 4 Shadings (Free-Form Gouraud-Shaded Triangle Meshes)
PDShadingType4 gouraudShading = new PDShadingType4(new COSStream());
gouraudShading.setShadingType(PDShading.SHADING_TYPE4);
// we use multiple of 8, so that no padding is needed
gouraudShading.setBitsPerFlag(8);
gouraudShading.setBitsPerCoordinate(16);
gouraudShading.setBitsPerComponent(8);
COSArray decodeArray = new COSArray();
// coordinates x y map 16 bits 0..FFFF to 0..FFFF to make your life easy
// so no calculation is needed, but you can only use integer coordinates
// for real numbers, you'll need smaller bounds, e.g. 0xFFFF / 0xA = 0x1999
// would allow 1 point decimal result coordinate.
// See in PDF specification: 8.9.5.2 Decode Arrays
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.get(0xFFFF));
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.get(0xFFFF));
// colors r g b map 8 bits from 0..FF to 0..1
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.ONE);
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.ONE);
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.ONE);
gouraudShading.setDecodeValues(decodeArray);
gouraudShading.setColorSpace(PDDeviceRGB.INSTANCE);

// Function is not required for type 4 shadings and not really useful, 
// because if a function would be used, each edge "color" of a triangle would be one value, 
// which would then transformed into n color components by the function so it is 
// difficult to get 3 "extremes".

OutputStream os = ((COSStream) gouraudShading.getCOSObject()).createOutputStream();
MemoryCacheImageOutputStream mcos = new MemoryCacheImageOutputStream(os);

// Vertex 1, starts with flag1
// (flags always 0 for vertices of start triangle)
mcos.writeByte(0);
// x1 y1 (left corner)
mcos.writeShort(0);
mcos.writeShort(0);
// r1 g1 b1 (red)
mcos.writeByte(0xFF);
mcos.writeByte(0);
mcos.writeByte(0);

// Vertex 2, starts with flag2
mcos.writeByte(0);
// x2 y2 (top corner)
mcos.writeShort(100);
mcos.writeShort(100);
// r2 g2 b2 (green)
mcos.writeByte(0);
mcos.writeByte(0xFF);
mcos.writeByte(0);

// Vertex 3, starts with flag3
mcos.writeByte(0);
// x3 y3 (right corner)
mcos.writeShort(200);
mcos.writeShort(0);
// r3 g3 b3 (blue)
mcos.writeByte(0);
mcos.writeByte(0);
mcos.writeByte(0xFF);

mcos.close();
// outside stream MUST be closed as well, see javadoc of MemoryCacheImageOutputStream
os.close();

to run the shading, call

contentStream.shadingFill(gouraudShading);

here's a different decode array, similar to the one from the example PDF you link to, although I used only 16 bit instead of 24:

COSArray decodeArray = new COSArray();
// coordinates x y map 16 bits 0..FFFF to -16384..16384
// this means that 0x8000 maps to 0
// some other useful values
//  - 0x862C maps to top of A4 page
//  - 0x84C4 maps to right of A4 page
//  - 0x8262 maps to horizontal middle of A4 page
decodeArray.add(COSInteger.get(-16384));
decodeArray.add(COSInteger.get(16384));
decodeArray.add(COSInteger.get(-16384));
decodeArray.add(COSInteger.get(16384));
// colors r g b map 8 bits from 0..FF to 0..1
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.ONE);
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.ONE);
decodeArray.add(COSInteger.ZERO);
decodeArray.add(COSInteger.ONE);
gouraudShading.setDecodeValues(decodeArray);

The coordinates for the triangle would then be 0x8000 0x8000, 0x8100 0x8100, 0x8200 0x8000.

Question:

I'm trying to send PDF documents for parsing using ExtractingRequestHandler. (Specifically I'm using SolrNet but I don't think it is related to this problem).

However, for every PDF file I sent I get the following warning in the log (from Solr Admin):

According to what I have researched, this is happens with certain PDFs when read by PDFBox. I found a similar bug report here which says to change the pushbacksize. The problem is I'm using Solr 5.2.1 and have not been able to figure out how to configure this setting. Is there a way to configure Solr so I can index these files?


Answer:

Your PDFs are broken. A PDF stream object looks like this:

4 0 obj
<<
/Length 34841
>>
stream
... content (which should have a length of 34841 bytes) ...
endstream
endobj

So if "endstream" doesn't appear at the expected offset, you get the message described. It means PDFBox tries a "Plan B"; if no further message is displayed, then the PDF will be processed. All you could do is to tell the creator of the PDF to work cleanly, i.e. to calculate the stream lengths properly. Or to avoid opening a PDF file in a "cheap" text editor and saving it.

The issue PDFBOX-2381 describes a different error, which is that the pushback buffer is too small.

Question:

I use apache pdfbox 2.0.0 version in my java code (java 1.6). I'm trying to figure out how I can get, replace and save back to my pdf a data from

<stream> data here... <endstream> ?

My pdf file looks like:

596 0 obj
<<
/Filter /FlateDecode
/Length 3739
>>
stream
xњ­[ЫnЬF}џoШ8эІАђhЮ/‰`@С%Hvќd-н"іXPJГ ...
endstream
endobj

I've found a solution how I can decode this stream. I used a "WriteDecodedDoc" command from the pdfbox-app-1.8.10.jar api. So now I have two variant of the file but I have NO idea how I can work with this stream. This stream contains footer and header where were placed images and text.

I checked my file with PDFTextStripper class. It can see necessary data from streams but I can't use this class in case of replacement and saving data back to pdf file.

I tried replace this text just open a file as text, search text, replace it only in stream and save. But I have a problem with "Cannot extract the embedded font...". The main reason is that I loose an encoding. I tried change this encoding but it didn't help me.

BTW I can't use iText. I should use free libs here.

Thanks for any solution.

Edit:

after decoding I have the stream like

stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Span <</Lang (en-US)/MCID 83 >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
endstream

I need to replace a link to a different link inside stream. This one:

[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ

EDIT 2 code

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // COSBase cosb = document.getDocument().getObjects().get(27);
            // e.g. this object contains <stream> bytecode <endstream> in the PDF file.
            // it looks that
            // document -> getDocument() -> objectPool #27 -> baseObject -> randomAccess -> bufferList size 10 has a data that I can't open and work
            // document -> getDocument() -> objectPool #27 -> baseObject -> items -> all PDF's tag but NO a stream section

            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                List<Object> tokens = parser.getTokens();
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.linkhouldbehere.com")) {
                                // some magic here to remove all indents and show new link from beginning.
                                // no rules. Just for test and it works here
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding of date from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    newTokens.add(token);
                }

                // save replaced content inside a page
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();
                page.setContents(newContents);

                // replace all links that have a pop-up line
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }
            // save file
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

EDIT 3.

The pdf contains the 660 0 obj that contains a necessary link inside:

660 0 obj
<<
/BBox [0.0 792.0 612.0 0.0]
/Length 792
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
>>
/Font <<
/T1_0 834 0 R
/T1_1 835 0 R
/T1_2 836 0 R
>>
/ProcSet [/PDF /Text]
>>
/Subtype /Form
>>
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Artifact <</O /Layout >>BDC 
BT
/CS0 cs 0.412 0.416 0.423  scn
/T1_0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 8 0 0 8 72 64.8 Tm
[(Visit )35(O)7(ur site R)23.1(esear)15.1(ch Manager )20.1(on )20(the )12(web at )]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.lin)-14.9(kshou)-10(ldbeh)-8(ere)-7.9(ninechars)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
EMC 
31.954 0 Td
[(A)15(ugust 7)45.1(,)-5( 2015)]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_0 1 Tf
8 0 0 8 540 64.8 Tm
( )Tj
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_2 1 Tf
7 0 0 7 72 55.3 Tm
[(\251 2015 )29(CCH Incorporated and its af\037liates. )38.3(All rights r)12(eserv)8.1(ed.)]TJ
ET
EMC 

endstream

and only one place I found where it is called from the pdf file. It is from 45 0 obj

/XObject <<
    /Fm0 660 0 R
    /Fm1 661 0 R
>>

a full text from obj:

45 0 obj
<<
/ArtBox [0.0 0.0 612.0 792.0]
/BleedBox [0.0 0.0 612.0 792.0]
/Contents 658 0 R
/CropBox [0.0 0.0 612.0 792.0]
/Group 659 0 R
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 13 0 R
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
/GS1 23 0 R
>>
/Font <<
/T1_0 597 0 R
/T1_1 26 0 R
/T1_2 28 0 R
/T1_3 25 0 R
>>
/ProcSet [/PDF /Text]
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
>>
/Rotate 0
/StructParents 22
/Tabs /W
/Thumb 662 0 R
/TrimBox [0.0 0.0 612.0 792.0]
/Type /Page
/Annots []
>>
endobj

A question is Can I get this 660 0 obj and process it by PDFBox? Because it looks like PDFStreamParser parser doesn't know anythig about this 660 0 object. Thank you.


Answer:

For PDFBox 2.0.0-SNAPSHOT. This is my code that works fine for me in case of a links replacement.

Thanks a lot Tilman Hausherr for his help.

String filePath = "d:\\pdf\\file1.pdf"

...

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            // Decrypt a document
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // replace all links in a footer and a header in XObjects with /ProcSet [/PDF /Text]
            // Note: these forms (and pattern objects too!) can have resources,
            // i.e. have Form XObjects or patterns again.
            // If so you need to use a recursion
            for (int pageNum = 0; pageNum < document.getPages().getCount(); pageNum++) {
                List<Object> newPdxTokens = new ArrayList<Object>();
                // Get all XObjects from the page
                Iterable<COSName> xobjs = document.getPage(pageNum).getResources().getXObjectNames();
                for (COSName xobj : xobjs) {
                    boolean isHasTextStream = false;
                    PDXObject pdxObject = document.getPage(pageNum).getResources().getXObject(xobj);
                    // If a stream has not '/ProcSet [/PDF /Text]' line inside it has to be skipped
                    // isXobjectHasTextFieldInPdf has a recursion
                    if (pdxObject.getCOSObject() instanceof COSDictionary) {
                        isHasTextStream = isXobjectHasTextFieldInPdf((COSDictionary) pdxObject.getCOSObject());
                    }

                    if (pdxObject instanceof PDFormXObject && isHasTextStream) {
                        // Set stream from pdxObject
                        PDStream stream = pdxObject.getStream();
                        PDFStreamParser streamParser = new PDFStreamParser(stream.toByteArray());
                        streamParser.parse();
                        for (Object token : streamParser.getTokens()) {
                            if (token instanceof Operator) {
                                Operator op = (Operator) token;
                                if (op.getName().equals("Tj")) {
                                    // Tj contains 1 COSString
                                    COSString previous = (COSString) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = previous.getString();
                                    // here can be any filters for checking a necessary string
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                } else if (op.getName().equals("TJ")) {
                                    // TJ contains a COSArray with COSStrings and COSFloat (padding)
                                    COSArray previous = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = "";
                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement instanceof COSString) {
                                            COSString cosString = (COSString) arrElement;
                                            String content = cosString.getString();
                                            string += content;
                                        }
                                    }
                                    // here can be any filters for checking a necessary string
                                    // check if string contains a necessary link
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    } else if (string.startsWith("www.testlink.com")) {
                                        // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                        COSArray newLink = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                        int size = newLink.size();
                                        float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                        for (int i = 0; i < size - 4; i++) {
                                            newLink.remove(0);
                                        }
                                        newLink.set(0, new COSString("test.test.com"));
                                        // number for indenting from right place. Should be checked.
                                        newLink.set(1, new COSFloat(f - 8000));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                }
                            }
                            // save tokens to a temporary List
                            newPdxTokens.add(token);
                        }
                        // save the replaced data back to the srteam
                        OutputStream out = stream.createOutputStream();
                        ContentStreamWriter writer = new ContentStreamWriter(out);
                        writer.writeTokens(newPdxTokens);
                        out.close();
                    }
                }
            }

            // replace data from any text stream from pdf. XObjects not included.
            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                // Get all tokens from the page
                List<Object> tokens = parser.getTokens();
                // Create a temporary List
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.testlink.com")) {
                                // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    // save tokens to a temporary List
                    newTokens.add(token);
                }
                // save the replaced data back to the document's srteam
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();

                // save content
                page.setContents(newContents);

                // replace all links that have a pop-up line (It does not affect the visible text)
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }

            // save document
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

an extra method to process only Text stream and skip an image stream. It is called from the main method "replaceLinksInPdf(String filePath)"

        // Check if COSDictionary has '/ProcSet [/PDF /Text]' string in the stream
        private static boolean isXobjectHasTextFieldInPdf(COSDictionary dictionary) {
            boolean isHasTextField = false;
            for (COSBase cosBase : dictionary.getValues()) {
                // go to a recursion because COSDictionary can have COSDictionaries inside
                if (cosBase instanceof COSDictionary) {
                    COSDictionary cosDictionaryNew = (COSDictionary) cosBase;
                    // check if '/ProcSet' has '/Text' param
                    if (cosDictionaryNew.containsKey(COSName.PROC_SET)) {
                        COSBase procSet = cosDictionaryNew.getDictionaryObject(COSName.PROC_SET);
                        if (procSet instanceof COSArray) {
                            for (COSBase procSetIterator : ((COSArray) procSet)) {
                                if (procSetIterator instanceof COSName
                                        && ((COSName) procSetIterator).getName().equals("Text")) {
                                    return true;
                                }
                            }
                        } else if (procSet instanceof COSString && ((COSString) procSet).getString().equals("Text")) {
                            return true;
                        }
                    }
                    // go to the COSDictionary children
                    isHasTextField = isXobjectHasTextFieldInPdf(cosDictionaryNew);
                }
            }
            return isHasTextField;
        }

It just a testing variant for my project. I will refactor this code with project's rules. You should change replacements as you need. Also I am using this PDFBox 2.0.0 lib about 1 week and maybe anyone can find more easy way to do some code. Feel free to make a code review and post a more appropriate variant. Thanks.

P.S. I've tested on it 40 PDFs and only 2 of them have to be processed deeply in case of recursion. All 40 files can be open, readable, look as previous version except links

Question:

I'm trying to convert an SVG file into a PDF for embedding into another PDF document. I'm using the batik transcoder, passing in the bytes for the SVG and getting the data for the PDF back.

My main PDF document and the SVG file passed into the transcoder both have dimensions of:

width="602.8" height="763.8"

The output PDF file generated from the SVG is smaller however. Because of this, when embedded into our main document, the generated SVG PDF doesn't take up all available space in our main PDF as I would expect it to because it has smaller dimensions. How can I force the output pdf to have the same dimensions of the main document / input SVG.


Answer:

So after some further research I came to a solution. We're using PDFBox as our pdf manipulation tool which uses a DPI of 72 by default for documents.

Batik on the other hand uses a DPI of 96 when transcoding an SVG to a PDF file. This makes the output file slightly smaller than the main PDFBox generated document. To switch Batik to a DPI that supports PDFBox by default we must change the pixel to mm conversion from 96dpi to 72dpi.

We can add a transcoding hint to our PDFTranscoder as follows:

transcoder.addTranscodingHint(PDFTranscoder.KEY_PIXEL_UNIT_TO_MILLIMETER,
(25.4f / 72f));

where (25.4f / 72f) is equal to 72dpi. This will replace the default dpi of 96dpi (25.4f / 96f)

Question:

I've already obtained bookmarks but I need to know where these bookmarks are located in the PDF. (Bookmark 1 = page 1,..., Bookmark 54= page 72 etc..). Anyone can help me? Thanks for the support.

PDDocument doc = PDDocument.load( ... );
PDDocumentOutline root = doc.getDocumentCatalog().getDocumentOutline();
PDOutlineItem item = root.getFirstChild();
  while( item != null )
  {
      System.out.println( "Item:" + item.getTitle() );
      item = item.getNextSibling();
  }

Answer:

Excerpt from the PrintBookmarks.java example from the source code download:

if (item.getDestination() instanceof PDPageDestination)
{
    PDPageDestination pd = (PDPageDestination) item.getDestination();
    System.out.println("Destination page: " + (pd.retrievePageNumber() + 1));
}
else if (item.getDestination() instanceof PDNamedDestination)
{
    PDPageDestination pd = document.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) item.getDestination());
    if (pd != null)
    {
        System.out.println("Destination page: " + (pd.retrievePageNumber() + 1));
    }
}

if (item.getAction() instanceof PDActionGoTo)
{
    PDActionGoTo gta = (PDActionGoTo) item.getAction();
    if (gta.getDestination() instanceof PDPageDestination)
    {
        PDPageDestination pd = (PDPageDestination) gta.getDestination();
        System.out.println("Destination page: " + (pd.retrievePageNumber() + 1));
    }
    else if (gta.getDestination() instanceof PDNamedDestination)
    {
        PDPageDestination pd = document.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) gta.getDestination());
        if (pd != null)
        {
            System.out.println("Destination page: " + (pd.retrievePageNumber() + 1));
        }
    }
}

Question:

Just wondering is there any way to name the document after you specify the doc.name to a template

  PDDocument doc = PDDocument.load(play.Play.application().resource("/templates/" + FileName));

  ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
            doc.save(byteArrayOutputStream);
            doc.close();

therefore, when you download PDFBOX rendered file, the name of the pdf file can not be specified. is there any other way to do it?


Answer:

I am not familiar with Play framework.

If you want to enable users to download the file and provide it a filename, then you need to set the HTTP header

Content-Disposition: attachment; filename=myfile.pdf

When the browser sees this header, the user will get a dialog box to save the file and will suggest the name to be myfile.pdf.

Question:


Answer:

PDFBox is going to be a great Java Library for PDFs. Currently, the latest version is not stable yet, by provides great solutions. The documentation is cute, so if you want to do easy stuff, you won't waste too much time for learning.

Question:

I'm using either pdfbox-app-2.0.18.jar or pdfbox-app-2.0.17.jar.

org.apache.pdfbox.Loader is not found....

I was trying to follow the example here.

And tried to write the below code :

try (FileOutputStream fos = new FileOutputStream(signedFile);
                PDDocument doc = Loader.loadPDF(inputFile))
        {

But this is not working. Why ?


Answer:

The Loader class has been added January 25, 2020. SVN log

It's not part of version 2.0.18, as it is not in this file: pdfbox-2.0.18-src.zip

So this class is simply too new and that's why you cannot use it!

Question:

I've followed this example Create landscape PDF and it works fine. I would now like to move the 0,0 reference from the lower left corner to the top left corner. To do that I need change contentStream.transform(new Matrix(0, 1, -1, 0, pageWidth, 0));. I've had a look at the documentation for PDFBox Matrix which specfices the arguments for Matrix like below.

public Matrix(float a,
      float b,
      float c,
      float d,
      float e,
      float f)
Creates a matrix with the given 6 elements.

But it doesn't tell me what the 6 different arguments/elements do. I guess one has do with rotation and two for moving the reference in X and Y direction. Where can I find a document that describes the arguments?


Answer:

Where can I find a document that describes the arguments?

The document to look for is the PDF specification (ISO 32000-1) in combination with some Linear Algebra 101.

A transformation matrix in PDF shall be specified by six numbers, usually in the form of an array containing six elements. In its most general form, this array is denoted [a b c d e f]; it can represent any linear transformation from one coordinate system to another.

(section 8.3.3 - Common Transformations)

The meaning is explained shortly thereafter:

PDF represents coordinates in a two-dimensional space. The point (x, y) in such a space can be expressed in vector form as [x y 1]. The constant third element of this vector (1) is needed so that the vector can be used with 3-by-3 matrices in the calculations described below.

The transformation between two coordinate systems can be represented by a 3-by-3 transformation matrix written as follows:

Because a transformation matrix has only six elements that can be changed, in most cases in PDF it shall be specified as the six-element array [a b c d e f].

Coordinate transformations shall be expressed as matrix multiplications:

(section 8.3.4 - Transformation Matrices)

Thus, when a transformation [a b c d e f] is currently set and you draw something using coordinates (x, y), it will appear at coordinates (x', y') where

Commonly used transformation types are:

  • Translations shall be specified as [1 0 0 1 tx ty], where tx and ty shall be the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively.

  • Scaling shall be obtained by [sx 0 0 sy 0 0]. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as sx and sy units, respectively, in the previous coordinate system.

  • Rotations shall be produced by [cos(q) sin(q) -sin(q) cos(q) 0 0], which has the effect of rotating the coordinate system axes by an angle q counter clockwise.

  • Skew shall be specified by [1 tan(a) tan(b) 1 0 0], which skews the x axis by an angle a and the y axis by an angle b.

(section 8.3.3 - Common Transformations)

If you want a combined transformation, simply multiply the matrices in the appropriate order.

Question:

When i have used the following dependency, 1.8.6 or 1.8.7, i am not able to get the class PDFieldTreeNode.

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>1.8.6</version>
    </dependency>

I have checked the jar i have downloaded, and also the sources jar as well, but the file is not present.

But when i changed to <version>2.0.0-SNAPSHOT</version>, i am able to get the file.

What is the issue in other versions, also this is not the new file added.


Answer:

According to the PDFBOX JavaDocs there is no class as PDFieldTreeNode

http://pdfbox.apache.org/docs/1.8.6/javadocs/

http://pdfbox.apache.org/docs/1.8.5/javadocs/

As on your comment check there svn repository, from last few releases there are modifications done in this file along with other files as well, but not added as a new.

The release date of the version 1.8.6 is (Jun 22, 2014) refer here

The first commit was made on the class file PDFieldTreeNode (Aug 16, 2014) I referred it on the SVN log.

Question:

I try to extract some text out of a PDF. For that I need to define a rectangle that contains the text.

I recognized that the coordinates may have a different meaning when I compare the coordinates from extraction of text to coordinates of drawing.

package MyTest.MyTest;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;

public class MyTest 
{   
  public static void main (String [] args) throws Exception
  { 
    PDDocument pd = PDDocument.load (new File ("my.pdf"));  
    PDFTextStripperByArea st = new PDFTextStripperByArea ();
    PDPage pg = pd.getPage (0);

    float h = pg.getMediaBox ().getHeight ();
    float w = pg.getMediaBox ().getWidth ();
    System.out.println (h + " x " + w + " in internal units");
    h = h / 72 * 2.54f * 10;
    w = w / 72 * 2.54f * 10;
    System.out.println (h + " x " + w + " in mm");



    int X = 85;
    int Y = 175;
    int dX = 250;
    int dY = 15;

    // extract some text
    st.addRegion ("a", new Rectangle (X, Y, dX, dY));
    st.extractRegions (pg);
    String text = st.getTextForRegion ("a");
    System.out.println("text="+text);


    // fill a rectangle
    PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
    contents.setNonStrokingColor (Color.RED);  
    contents.addRect (X, Y, dX, dY);
    contents.fill ();
    contents.close ();
    pd.save ("x.pdf");
  }
}

The text I extract (output of text= in the console) is not the text I overdraw with my red rectangle (generated x.pdf).

Why??

For testing try some PDF you already have. To avoid a lot of try/error in aiming for a rectangle with text in it use a file with a lot of text.


Answer:

There are (at least) two issues in your approach:

Different coordinate systems

You use st.addRegion. Its JavaDoc comment tells us:

/**
 * Add a new region to group text by.
 *
 * @param regionName The name of the region.
 * @param rect The rectangle area to retrieve the text from. The y-coordinates are java
 * coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
 */
public void addRegion( String regionName, Rectangle2D rect )

(Actually the whole text extraction apparatus of PDFBox uses its own coordinate system, and there already have been many questions on stack overflow because of irritations this caused.)

On the other hand contents.addRect does not use those "java coordinates". Thus, you have to subtract the y coordinate you use in text extraction from the maximum crop box y coordinate to get a coordinate for addRect.

Furthermore, the region rectangles have their anchor point at the top left while the regular PDF rectangles (like the one you define with contents.addRect) have it at the bottom left. Thus, you additionally have to add or subtract the rectangle height from the y coordinate.

Actually you may have to change the x coordinate, too. It is not mirrored but there may be a shift, the PDFBox text extraction coordinate system uses x=0 for the left page border but that is not necessarily the case in PDF user space. Thus, you may have to add the left border x coordinate of the crop box to your text extraction x coordinate.

Possibly changed coordinate system

In the page content stream the coordinate system may have been changed by applying a transformation to the current transformation matrix. As a result the coordinates in the instructions you append to it may have a different meaning than even outlined above.

To rule out such an effect, you should use a different PDPageContentStream constructor with an additional boolean resetContext parameter:

/**
 * Create a new PDPage content stream.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @param appendContent Indicates whether content will be overwritten, appended or prepended.
 * @param compress Tell if the content stream should compress the page contents.
 * @param resetContext Tell if the graphic context should be reset. This is only relevant when
 * the appendContent parameter is set to {@link AppendMode#APPEND}. You should use this when
 * appending to an existing stream, because the existing stream may have changed graphic
 * properties (e.g. scaling, rotation).
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
                           boolean compress, boolean resetContext) throws IOException

I.e. replace

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);

by

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false, false);

Question:

I have an Excel file I'm reading in with Apache POI and getting headers, rows & cell data which I'm storing in two arrays. I am using PDF box and am trying to display all of the headers vertically with the corresponding row data.

For example: the header row contains name, email, date of birth, salary and department followed by three rows of data. Each row of data should produce its own PDF document listing the labels and values vertically down the page but for now I'm only trying to list all the values in one PDF document.

  • Name: Joe Smith
  • Email: js@something.com
  • Date of Birth: 05/06/1976
  • Salary: $100,000.00
  • Department: Sales

The problem I'm having is only the last record in my Excel sheet is printing to the PDF and the headers are mismatching the corresponding row and cell data. I believe this is due to the mismatch in the count of 5 elements in the header array and 15 elements in the cell array.

Below are some snippets of what I'm doing to produce this along with a screen shot of both my Excel file layout and the PDF that's being generated.

My Excel Sheet (right click to open not sure why it isn't linking to full size pics below)

//Hold header values in Array
List<String> headerValues = new ArrayList<>();

//Hold cell values in Array
List<String> cellValues = new ArrayList<>();

for(int i = 0; i < headerValues.size() && i < cellValues.size(); i++){
    cont.showText(headerValues.get(i) + cellValues.get(i));
    cont.newLine();
}

Here was another way I had done this which produced the same result as the for loop

Iterator<String> h = headerValues.iterator();
Iterator<String> c = cellValues.iterator();

while(h.hasNext() && c.hasNext()){
    cont.showText(h.next() +":" + c.next());
    cont.newLine();
}

The result I'm trying to get to is print each of the header values in my excel spreadsheet vertically for each row's data in my Excel Sheet.


Answer:

Your collection structure is not optimal. At least your cell values should not be all in one list. The data structure of an default Excel table is like a database table. There are data records (the rows) which consists of fields (the column headers). So your cell values should be in a List<List<String>> dataRecords. There the main List is a list of rows and the inner List is the list of fields in the rows.

The column headers might be a List<String> colHeaders but the better structure would be a TreeMap<Integer, String> colHeaders where the Integer key is the column index in the Excel sheet. Using that it is clear what Excel columns in the sheet really contain the data fields.

Let's have a complete example:

import java.io.FileInputStream;

import org.apache.poi.ss.usermodel.*;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.font.*;
import org.apache.pdfbox.pdmodel.common.*;

import java.util.Map;
import java.util.TreeMap;
import java.util.List;
import java.util.ArrayList;

class GetDataFromExcel {

 public static void main(String[] args) throws Exception {

  Workbook workbook = WorkbookFactory.create(new FileInputStream("ExcelExample.xlsx"));

  DataFormatter dataFormatter = new DataFormatter();
  FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();

  Sheet sheet = workbook.getSheetAt(0);

  int headerRowNum = sheet.getFirstRowNum();

  // collecting the column headers
  TreeMap<Integer, String> colHeaders = new TreeMap<Integer, String>();
  Row row = sheet.getRow(headerRowNum);
  for (Cell cell : row) {
   int colIdx = cell.getColumnIndex();
   String value = dataFormatter.formatCellValue(cell, formulaEvaluator);
   colHeaders.put(colIdx, value);
  }

System.out.println(colHeaders);

  // collecting the data records
  List<List<String>> dataRecords = new ArrayList<List<String>>();
  for (int r = headerRowNum + 1; r <= sheet.getLastRowNum(); r++) {
   row = sheet.getRow(r); if (row == null) row = sheet.createRow(r);
   List<String> values = new ArrayList<String>();
   for (Map.Entry<Integer, String> entry : colHeaders.entrySet()) {
    int colIdx = entry.getKey();
    Cell cell = row.getCell(colIdx); if (cell == null) cell = row.createCell(colIdx);
    String value = dataFormatter.formatCellValue(cell, formulaEvaluator);
    values.add(value);
   }
   dataRecords.add(values);
  }

System.out.println(dataRecords);

  workbook.close();

  // create PDF
  final PDFont font = PDType1Font.HELVETICA;
  final float fontSize = 12.0f;
  final float lineHeight = fontSize * 1.42857f;
  PDPage page = new PDPage(); //U.S. Letter, 8.5" x 11"
  final PDRectangle artBox = page.getArtBox();
  final float artBoxHeight = artBox.getHeight();
  final float artBoxWidth = artBox.getWidth();
  final float leftMargin = artBoxWidth / 8.5f; // 1"
  final float topMargin = artBoxHeight / 11.0f; // 1"
  final float bottomMargin = artBoxHeight / 11.0f; // 1"

  PDDocument doc = new PDDocument();

  doc.addPage(page);
  PDPageContentStream contents = new PDPageContentStream(doc, page);
  contents.beginText();
  contents.setFont(font, fontSize);
  float currentLinePos = artBoxHeight-topMargin;
  contents.newLineAtOffset(leftMargin, currentLinePos);

  for (List<String> dataRecord : dataRecords) {
   Integer colIdx = colHeaders.firstKey();
   for (String value : dataRecord) {
    if (colIdx != null) {
     String header = colHeaders.get(colIdx);
     if (currentLinePos <= bottomMargin) {
      contents.endText();
      contents.close();
      page = new PDPage();
      doc.addPage(page);
      contents = new PDPageContentStream(doc, page);
      contents.beginText();
      contents.setFont(font, fontSize);
      currentLinePos = artBoxHeight-topMargin;
      contents.newLineAtOffset(leftMargin, currentLinePos);
     }
     contents.showText(header + ": " + value);
     contents.newLineAtOffset(0, -lineHeight);
     currentLinePos -= lineHeight;
    }
    colIdx = colHeaders.higherKey(colIdx);
   }
   contents.newLineAtOffset(0, -lineHeight);
   currentLinePos -= lineHeight;   
  }

  contents.endText();
  contents.close();

  doc.save("ExcelExample.pdf");
  doc.close();

 }
}

Question:

I am trying to add a logo to my PDF file using PDFBox version 2.0.1. I have the following code:

public class PDFService {

    public void createPdf() {
        // Create a document and add a page to it
        PDDocument document = new PDDocument();

        PDPage page = new PDPage();

        document.addPage(page);

        // Create a new font object selecting one of the PDF base fonts
        PDFont font = PDType1Font.HELVETICA_BOLD;

        ServletContext servletContext = (ServletContext) FacesContext
                .getCurrentInstance().getExternalContext().getContext();

        try {

            PDImageXObject pdImage = PDImageXObject.createFromFile(
                    servletContext.getRealPath("/resources/images/logo.png"),
                    document);

            PDPageContentStream contentStream = new PDPageContentStream(
                    document, page);

            contentStream.drawImage(pdImage, 20, 20);

            contentStream.beginText();
            contentStream.setFont(font, 12);
            contentStream.endText();

            // Make sure that the content stream is closed:
            contentStream.close();

            // Save the results and ensure that the document is properly closed:
            document.save("Hello World.pdf");
            document.close();

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}

I am getting the error javax.imageio.IIOException: Can't read input file! in the line

PDImageXObject pdImage = PDImageXObject.createFromFile(
                    servletContext.getRealPath("/resources/images/logo.png"),
                    document);

The path returned by servletContext.getRealPath is C:\Users\erickpezoa\Desktop\Multivision\Materials\apps\eclipse Kepler\eclipse\Projects\.metadata\.plugins\org.eclipse.core.resources\Servicios_Exequiales\build\weboutput\resources\images\logo.png

What am I doing wrong here?


Answer:

If you are using Maven and your images folder is under src/main/resources in Eclipse, you can try:

PDImageXObject pdImage = PDImageXObject.createFromFile(
                PDFService.class.getResource("/images/logo.png").getPath(),
                document);

is only needed /resources/images/logo.png as path if under src/main/resources you have another folder called resources. Or not using Maven, and your output folder contains: /resources/images. In that case:

PDImageXObject pdImage = PDImageXObject.createFromFile(
                PDFService.class.getResource("/resources/images/logo.png").getPath(),
                document);

Hope it helps.

Question:

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file

After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:

org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

Now I am stuck and unable to find the solution. Please assist if anyone can.

//////UPDATE AS REPLY ON COMMENTS///

I am using pdfbox-1.8.10

Here is the code:

public void getimg ()throws Exception {

try {
        String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
        String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
        File oldFile = new File(sourceDir);
        if (oldFile.exists()){
              PDDocument document = PDDocument.load(sourceDir);
               List<PDPage> list =   document.getDocumentCatalog().getAllPages();
               String fileName = oldFile.getName().replace(".pdf", "_cover");
               int totalImages = 1;
               for (PDPage page : list) {
                   PDResources pdResources = page.getResources();
                   Map pageImages = pdResources.getXObjects();
                    if (pageImages != null){
                      Iterator imageIter = pageImages.keySet().iterator();
                      while (imageIter.hasNext()){
                      String key = (String) imageIter.next();
                      Object obj = pageImages.get(key);

                      if(obj instanceof PDXObjectImage) {
               PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;

                         pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);

                     totalImages++;
                      }
                      }
                    }
               }
        }  else {
                    System.err.println("File not exist");
                       }  
}
catch (Exception e){

    System.err.println(e.getMessage());
 }
 }

//// PARTIAL SOLUTION/////

I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.


Answer:

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.

Code for 1.8 can be found here: https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup

Code for 2.0 can be found here: https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date

(Even these are not always perfect, see this answer)

The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Question:

My task is to extract text from PDF for a specific coordinates.

I have used Apache Pdfbox client for data extraction .

To get the x, y , height and width coordinates from the PDF i am using PDF X change tool which is in Millimeter. When i pass the value in the rectangle the values are not getting empty value.

public String getTextUsingPositionsUsingPdf(String pdfLocation, int pageNumber, double x, double y, double width,
                double height) throws IOException {
            String extractedText = "";
            // PDDocument Creates an empty PDF document. You need to add at least
            // one page for the document to be valid.
            // Using load method we can load a PDF document
            PDDocument document = null;
            PDPage page = null;
            try {
                if (pdfLocation.endsWith(".pdf")) {
                    document = PDDocument.load(new File(pdfLocation));
                    int getDocumentPageCount = document.getNumberOfPages();
                    System.out.println(getDocumentPageCount);

                    // Get specific page. THe parameter is pageindex which starts with // 0. If we need to
                    // access the first page then // the pageIdex is 0 PDPage
                    if (getDocumentPageCount > 0) {
                        page = document.getPage(pageNumber + 1);
                    } else if (getDocumentPageCount == 0) {
                        page = document.getPage(0);
                    }
                    // To create a rectangle by passing the x axis, y axis, width and height 
                    Rectangle2D rect = new Rectangle2D.Double(x, y, width, height);
                    String regionName = "region1";

                    // Strip the text from PDF using PDFTextStripper Area with the
                    // help of Rectangle and named need to given for the rectangle
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
                    stripper.addRegion(regionName, rect);
                    stripper.extractRegions(page);
                    System.out.println("Region is " + stripper.getTextForRegion("region1"));
                    extractedText = stripper.getTextForRegion("region1");
                } else {
                    System.out.println("No data return");
                }
            } catch (IOException e) {
                System.out.println("The file  not found" + "");
            } finally {
                document.close();
            }
            // Return the extracted text and this can be used for assertion
            return extractedText;
        }

Please suggest whether my way is correct or not..


Answer:

I have used this PDF tutorialspoint.com/uipath/uipath_tutorial.pdf.. Where i am trying to find the text "a part of contests" which is have x = 55.6 mm y = 168.8 width = 210.0 mm and height = 297.0. But i am getting empty value

I tested your method with those inputs:

System.out.println("Extracting like Venkatachalam Neelakantan from uipath_tutorial.pdf\n");
float MM_TO_UNITS = 1/(10*2.54f)*72;
String text = getTextUsingPositionsUsingPdf("src/test/resources/mkl/testarea/pdfbox2/extract/uipath_tutorial.pdf",
        0, 55.6 * MM_TO_UNITS, 168.8 * MM_TO_UNITS, 210.0 * MM_TO_UNITS, 297.0 * MM_TO_UNITS);
System.out.printf("\n---\nResult:\n%s\n", text);

(ExtractText test testUiPathTutorial)

and got the result

 part of contents of this e-book in any manner without written consent 

te the contents of our website and tutorials as timely and as precisely as 
, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. 
guarantee regarding the accuracy, timeliness or completeness of our 
tents including this tutorial. If you discover any errors on our website or 
ease notify us at contact@tutorialspoint.com 

i 

Assuming you actually were looking for "a part of contents", not "a part of contests", merely the 'a' is missing; probably when measuring you looked for the beginning of the visible letter drawing but the actual glyph origin is a bit before that. If you choose a slightly smaller x, e.g. 54.6 mm, you'll also get the 'a'.

It obviously is no surprise that you get more than "a part of contents", considering the width and height of your rectangle.

Should you wonder about the MM_TO_UNITS constant, have a look at this answer.

Question:

I have been attempting to solve this issue for a while. I have the latest PDFBox (2.0.7) and FontBox (2.0.7) for my program, and yet no matter what I do, I am getting the same compilation error.

Within this class, here are my relevant imports:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDTrueTypeFont;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.pdmodel.common.PDRectangle;

I am attempting to set the font with the following sample:

PDDocument pdfDoc = new PDDocument();
PDPage page = new PDPage();
pdfDoc.addPage(page);

PDPageContentStream contents = new PDPageContentStream(pdfDoc, page);
PDFont font = PDType0Font.load(pdfDoc, new File("/path/to/font/Roboto-Regular.ttf"));
contents.setFont(font, 20);

Unfortunately, as I have stated, I get the following compilation error every time:

 error: cannot find symbol
 PDFont font = PDType0Font.load(pdfDoc, new File("/path/to/font/Roboto-Regular.ttf"));
 symbol:   method load(PDDocument,File)
 location: class PDType0Font

I have looked at the Javadocs multiple times, I have opened up the JAR file to confirm that that method is there (it is), and I have tried other things such as initializing the "font" as an instance of PDType0Font instead of the interface PDFont. Same error. I have tried importing every single JAR the website offers for 2.0.7. (Preflight, xmpbox, pdfbox-tools, pdfbox-debugger) and I still get the same error. I have tried importing every single class from the pdmodel and pdmodel.font packages. Same error. Everything else works fine - it is just this one particular method. Initially I had used PDTrueTypeFont instead of PDType0Font and it was just fine. But I have to switch to PDType0Font due to foreign characters.

EDIT: Solved. It turns out an outdated Tika JAR in my classpath was creating a conflict and reverting PDFBox to version 1.8.13.


Answer:

This issue has been solved. It turns out there was a conflict in my classpath. I had a very outdated Tika JAR that had PDFBox 1.8 within it, so I have updated Tika to the most recent version, and no longer have issues. Thank you to Tilman Hausherr who suggested the solution.

Could it be that there's an old .jar file in your class path? Try adding Exception e = new COSVisitorException(new Exception());. If that one works, then it means you have an 1.8 version in your classpath (and you shouldn't!)

Pdfbox - Cannot find symbol for PDType0Font.load

Question:

I'm using Apache's PDFBox version 2.0.4 and am having a problem using lineTo and curveTo. My function takes parameters of radians, starting degrees and ending degrees and then uses lineTo and curveTo to generate a slice of a pie chart.

mContents.setNonStrokingColor(color);
mContents.moveTo(0, 0);
List<Float> smallArc = createSmallArc(rad, Math.toRadians(startDeg), Math.toRadians(endDeg));
mContents.lineTo(smallArc.get(0), smallArc.get(1));
mContents.curveTo(smallArc.get(2), smallArc.get(3), smallArc.get(4), smallArc.get(5), smallArc.get(6), smallArc.get(7));
mContents.closePath();
mContents.fill();

The pie chart generates and appears to be fine. My app adds a footer which contains a logo that it reads from a file as follows:

try {
    pdImage = PDImageXObject.createFromFile(mFullImagePath, mDoc);
}catch(IOException ie){System.out.println("Error opening image file - "+ie.getMessage());}
try {
    mContents.drawImage(pdImage,250,5,pdImage.getWidth()/2,pdImage.getHeight()/2);
}catch(IOException e){System.out.println("Error adding image file - "+ e.getMessage());}

When the pi chart is included in the pdf generated, the footer and image are not in the pdf. Stubbing the code to generate the pie chart and the footer shows up with the image included.

Currently have to add the pie chart specifying specific coordinates after the page has been generated otherwise the additional lines below the pie chart do not appear.

Could the curveTo and lineTo generated output be bigger than what is displayed causing these issues?

EDIT - adding the image in the footer before drawing the graph and the image, graph and text all appear.

Appreciate any pointers

Complete code:

import com.google.code.geocoder.Geocoder;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.util.Matrix;
import org.apache.tomcat.jni.Address;
import org.slf4j.Logger;

import java.awt.*;
import java.io.IOException; 
import java.text.DecimalFormat; 
import java.util.ArrayList; 
import java.util.Calendar; 
import java.util.List; 
import java.util.Locale;

/**  * Created by tim on 7/6/2017.  */
public class ReportDataPDFBox {
    private PDDocument mDoc = null;
    private PDPage mPage = null;
    private PDImageXObject pdImage = null;
    private PDFont mHeaderFont = PDType1Font.HELVETICA_BOLD;

    private final int FONT_SIZE_HDR1 = 16;
    private final int FONT_SIZE_HDR2 = 14;
    private final int FONT_SIZE_REG = 12;
    private final int HDR_INDENT = 30;
    private final int BODY_INDENT_1 = 55;
    private final int BODY_INDENT_2 = 65;
    private final int BODY_INDENT_3 = 75;

    private PDFont mRegFont = PDType1Font.HELVETICA;
    PDPageContentStream mContents = null;

    private String mReportName = null;
    private String mFullImagePath = null;
    private String mMonth = null;
    private boolean mReportDone = true;
    private int mHorizonVal = 700;
    private int mHorizonGrph = 0;
    private long[] mDayPercent;

    private Calendar mCurrentCalendar = null;
    ProcessFrequencyData pfd = null;
    ProcessWeatherData pwd = null;
    ProcessPerformanceData ppd = null;
    Logger log = null;
    Color[] mColor = {Color.PINK,Color.YELLOW,Color.CYAN, Color.BLUE,Color.RED,Color.GREEN,Color.ORANGE,Color.LIGHT_GRAY};

    public ReportDataPDFBox(Logger logger, ProcessFrequencyData pfd, ProcessWeatherData pwd, ProcessPerformanceData ppd){
        this.log = logger;
        this.pfd = pfd;
        this.pwd = pwd;
        this.ppd = ppd;
        initializeDoc();
    }

    public void initializeDoc(){
        mDoc = new PDDocument();
        mPage = new PDPage();
        mDoc.addPage(mPage);
        mFullImagePath = "logo.png";
        mCurrentCalendar = Calendar.getInstance();
        mMonth = mCurrentCalendar.getDisplayName(Calendar.MONTH, Calendar.LONG, Locale.getDefault());
        mReportName = mMonth + ".pdf";
        try{
            mContents = new PDPageContentStream(mDoc, mPage);
        }catch(IOException e){System.out.println("Error setting content stream - "+e.getLocalizedMessage());}
    }

    public boolean writeTheReport(){
        addHeader();
        addFooter();

        generateReportContent();

//        addFooter();
        cleanUpALlDone();
        return mReportDone;
    }

    private void addHeader(){
        try {
            mContents.beginText();
            mContents.setFont(mHeaderFont,FONT_SIZE_HDR1);
            mContents.newLineAtOffset(200, 740);
            mContents.showText(mMonth + " - ActoTracker Report - " + mCurrentCalendar.get(Calendar.YEAR));
            mContents.endText();
        }catch (IOException io){System.out.println("Error with text content screen");}
    }

    private void generateReportContent(){
        addNumberRunInfo();
        addLocationRunInfo();
        addWeekDayInfo();
        addWeekInfo();
        addFrequencyData();
        pukeMeAChart();
           // generateDailyChart();
    }

    private void addNumberRunInfo(){
        int daysActive = Utility.getDaysBetweenDates(Utility.getOldestDate(pfd.getFirstDate(),
    pwd.getFirstDate()), Calendar.getInstance().getTimeInMillis());
        writeLine(mHeaderFont, FONT_SIZE_HDR2,HDR_INDENT, "Frequency Information");
        long percentActiveIdle = (pfd.getTotalDaysRun()*100/daysActive);
        String line = "Number of Runs - " + pfd.getTotalDaysRun() + "    Number of days ActoTracker active - " + daysActive + "   Percent run =
    "+percentActiveIdle;
        writeLine(mRegFont, FONT_SIZE_REG, BODY_INDENT_1, line);
    }

    private void addLocationRunInfo(){
        String line = "Number of locations run = " + pfd.getLocationRun();
        writeLine(mRegFont,FONT_SIZE_REG,BODY_INDENT_1,line);
        for (int i=1; i<=pfd.getLocationRun();i++){
            String[] locationInfo = pfd.getLocationInfo(i);
            long percent = pfd.getRunsByLocation(i)*100/pfd.getTotalDaysRun();
            String line2= new String( locationInfo[0] + " - " + locationInfo[1] +" , "+locationInfo[2]+ "  Number of runs = " +
    pfd.getRunsByLocation(i) + "  Percent of runs = " +percent );
            writeLine(mRegFont, FONT_SIZE_REG,BODY_INDENT_2,line2);
        }
    }

    private void addWeekDayInfo(){
        int totDaysRunning = pfd.getTotalRunDay();
        int leastCnt = 0;
        int mostCnt = 0;
        mHorizonGrph = mHorizonVal - 90;
        mDayPercent = new long[8];
        String mostDay = " most common day";
        String leastDay = " least common day";
        DayFrequencyResults frequency = pfd.getDayDistribution();
        int[] leastDays = frequency.getLessDays();
        int[] mostDays = frequency.getMostDays();
        StringBuilder leastString = new StringBuilder();
        StringBuilder mostString = new StringBuilder();
        for (int i=0; i< leastDays.length;i++){
            if (leastDays[i] != 0) {
                leastString.append(Utility.getDayName(leastDays[i])).append(" ");
                leastCnt++;
            }
        }
        for (int j=0; j< mostDays.length;j++){
            if (mostDays[j] != 0) {
                mostString.append(Utility.getDayName(j+1)).append(" ");
                mostCnt++;
            }
        }
        if (leastCnt > 1){leastDay += "s";}
        if (mostCnt > 1) {mostDay +="s";}
        String line = mostString.toString()+mostDay+ " to run"+ "     "+leastString.toString()+leastDay+" to run";
        writeLine(mRegFont,FONT_SIZE_REG,BODY_INDENT_1,line);
        for (int i=1;i<8;i++){
            String day = new String(Utility.getDayName(i)+"  " + pfd.getRweekDayCount(i) + " runs "+" 
    "+pfd.getRweekDayCount(i)*100/totDaysRunning)+ "%";
            writeLine(mRegFont,FONT_SIZE_REG,BODY_INDENT_2,day);
            double x = pfd.getRweekDayCount(i) / (double)pfd.getTotalDaysRun();
            mDayPercent[i] = Math.round(360*x);
        }
        System.out.println("BreakPoint");
    }

    private void addWeekInfo(){
        String line;
        Integer[] largestWeekTotals = {0,0,0,0,0,0,0};
        double largestDistance = 0D;
        double firstHalfDist = 0D;
        double secondHalfDist = 0D;
        DecimalFormat df = new DecimalFormat("####.##");

        int[] distFreq = pfd.getMonthlySummaryInfo();
        if (distFreq[0] > distFreq[1]){
            line = "Ran more in first half of months run.   "+ distFreq[0] + " times versus "+ distFreq[1]+" times";
        }else{
            line = "Ran more in second half of months run.   " + distFreq[1] + " times versus " + distFreq[0]+" times";
        }
        writeLine(mRegFont,FONT_SIZE_REG, BODY_INDENT_1, line);

        for (int i = 1; i<7;i++){
            if (i<4){
                firstHalfDist += Utility.getMileage(pfd.fa.getWeekDistanceTotal(i),false);
            }else{
                secondHalfDist += Utility.getMileage(pfd.fa.getWeekDistanceTotal(i),false);
            }
        }
        if (firstHalfDist > secondHalfDist){
            line = new String ("Ran further in the first half of the month " + df.format(firstHalfDist) + " miles versus " +
    df.format(secondHalfDist) + " miles");
        }else{
            line = new String ("Ran further in the second half of the month " + df.format(secondHalfDist) + " miles versus " +
    df.format(firstHalfDist)+ " miles");
        }
            writeLine(mRegFont,FONT_SIZE_REG,BODY_INDENT_1, line);
        }

        private void addFrequencyData(){

        int greatestFreq = 0;
        int leastDiff = 0;
        int greatestDiff = 0;
        int leastFreq = 0;
        for (int i=0; i<30; i++){
            int cnt = ppd.getRunsByFrequentcy(i);
            if (cnt > greatestFreq){
                greatestFreq = cnt;
                greatestDiff = i;
            }
            else{
                if (cnt > 0 && i>leastDiff){
                    leastDiff = i;
                    leastFreq = cnt;
                }
            }

            log.info("Frequency?? = " + cnt + " index = "+i);
        }
        String line = greatestDiff + " days is the most common frequency between runs "+greatestFreq+" times";
        writeLine(mRegFont,FONT_SIZE_REG,BODY_INDENT_1,line);
        String line2 = leastDiff + " days longest time between runs " + leastFreq + " times";
        writeLine(mRegFont,FONT_SIZE_REG,BODY_INDENT_1,line2);
    }

    private void writeLine(PDFont font, int fontSize, int indent, String text){
        mHorizonVal -= 20;
        try {
            mContents.beginText();
            mContents.setFont(font, fontSize);
            mContents.newLineAtOffset(indent,mHorizonVal);
            mContents.showText(text);
            mContents.endText();
        }catch(IOException e){}
    }

    private void addFooter(){
        log.info("IN addFooter");
        mPage = new PDPage();
        mDoc.addPage(mPage);
        try {
            pdImage = PDImageXObject.createFromFile(mFullImagePath, mDoc);
        }catch(IOException ie){System.out.println("Error opening image file - "+ie.getMessage());}
        try {
            mContents.drawImage(pdImage,250,5,pdImage.getWidth()/2,pdImage.getHeight()/2);
        }catch(IOException e){log.error("Error adding image file - "+ e.getLocalizedMessage());}
    }

    private void cleanUpALlDone(){
        try {
            mContents.close();
            mDoc.save(mReportName);
            mDoc.close();
        }catch (IOException ie){System.out.println("Error closing PDF document - " + ie.getMessage());}
    }

    private void generateDailyChart(){
        int totalVal = 0;
        try {
            mContents.transform(Matrix.getTranslateInstance(375, 525));
        }catch(IOException e){}

        for (int i=1; i< 8;i++){
            totalVal += mDayPercent[i];
            writeTheChart(mDayPercent[i-1], totalVal,mColor[i]);
            log.info("Color selected = " +mColor[i] +"Index = "+i);
        }
    }

    private void writeTheChart(long beg, long end, Color color){
        try {
            log.info("Color received = " + color);
            drawSlice(color, 60,beg, end);
        }catch(IOException e){}
    }

    private void pukeMeAChart(){
        try {
            mContents.transform(Matrix.getTranslateInstance(375,525));
            drawSlice(Color.YELLOW, 60, 0, 69);
            mContents.fill();
            drawSlice(Color.BLUE, 60, 69, 117);
            drawSlice(Color.RED, 60, 117, 181);
            mContents.fill();
            drawSlice(Color.WHITE, 60, 181, 208);
            mContents.fill();
            drawSlice(Color.GREEN, 60, 208, 272);
            mContents.fill();
            drawSlice(Color.YELLOW, 60, 272, 336);
            drawSlice(Color.BLUE, 60, 336, 360);
            mContents.fill();
        } catch(IOException e ){}
    }

    private void drawSlice(Color color, float rad, float startDeg, float endDeg) throws IOException
    {
        mContents.setNonStrokingColor(color);
        mContents.moveTo(0, 0);
        List<Float> smallArc = createSmallArc(rad, Math.toRadians(startDeg), Math.toRadians(endDeg));
        mContents.lineTo(smallArc.get(0), smallArc.get(1));
        mContents.curveTo(smallArc.get(2), smallArc.get(3), smallArc.get(4), smallArc.get(5), smallArc.get(6), smallArc.get(7));
        mContents.closePath();
        mContents.fill();
    }

    private List<Float> createSmallArc(double r, double a1, double a2)
    {
        // Compute all four points for an arc that subtends the same total angle
        // but is centered on the X-axis
        double a = (a2 - a1) / 2;
        double x4 = r * Math.cos(a);
        double y4 = r * Math.sin(a);
        double x1 = x4;
        double y1 = -y4;
        double q1 = x1*x1 + y1*y1;

        double q2 = q1 + x1*x4 + y1*y4;
        double k2 = 4/3d * (Math.sqrt(2 * q1 * q2) - q2) / (x1 * y4 - y1 * x4);
        double x2 = x1 - k2 * y1;
        double y2 = y1 + k2 * x1;
        double x3 = x2;
        double y3 = -y2;

        // Find the arc points' actual locations by computing x1,y1 and x4,y4
        // and rotating the control points by a + a1

        double ar = a + a1;
        double cos_ar = Math.cos(ar);
        double sin_ar = Math.sin(ar);

        List<Float> list = new ArrayList<Float>();
        list.add((float) (r * Math.cos(a1)));
        list.add((float) (r * Math.sin(a1)));
        list.add((float) (x2 * cos_ar - y2 * sin_ar));
        list.add((float) (x2 * sin_ar + y2 * cos_ar));
        list.add((float) (x3 * cos_ar - y3 * sin_ar));
        list.add((float) (x3 * sin_ar + y3 * cos_ar));
        list.add((float) (r * Math.cos(a2)));
        list.add((float) (r * Math.sin(a2)));
        return list;
    }
}

Answer:

In contrast to your assumption, you are not having a problem using lineTo and curveTo, i.e. your method drawSlice. You are having problems in the code using that method, i.e. here:

private void pukeMeAChart(){
    try {
        mContents.transform(Matrix.getTranslateInstance(375,525));
        drawSlice(Color.YELLOW, 60, 0, 69);
        mContents.fill();
        drawSlice(Color.BLUE, 60, 69, 117);
        drawSlice(Color.RED, 60, 117, 181);
        mContents.fill();
        drawSlice(Color.WHITE, 60, 181, 208);
        mContents.fill();
        drawSlice(Color.GREEN, 60, 208, 272);
        mContents.fill();
        drawSlice(Color.YELLOW, 60, 272, 336);
        drawSlice(Color.BLUE, 60, 336, 360);
        mContents.fill();
    } catch(IOException e ){}
}

This method starts by translating the coordinate system

mContents.transform(Matrix.getTranslateInstance(375,525));

and does not undo that translation when it is finished. Thus, the footer and image are in the pdf, merely not where you would expect them but instead translated, probably outside the crop box.

To undo the translation (and other changes, too, like the fill color), simply store the graphics state at the start of pukeMeAChart and restore it at the end of it.

Furthermore, drawSlice fills the slice itself, so there is no path to fill in pukeMeAChart anymore. Thus, the fill calls there are invalid.

All changes applied:

private void pukeMeAChart(){
    try {
        mContents.saveGraphicsState();
        mContents.transform(Matrix.getTranslateInstance(375,525));
        drawSlice(Color.YELLOW, 60, 0, 69);
        drawSlice(Color.BLUE, 60, 69, 117);
        drawSlice(Color.RED, 60, 117, 181);
        drawSlice(Color.WHITE, 60, 181, 208);
        drawSlice(Color.GREEN, 60, 208, 272);
        drawSlice(Color.YELLOW, 60, 272, 336);
        drawSlice(Color.BLUE, 60, 336, 360);
        mContents.restoreGraphicsState();
    } catch(IOException e ){}
}

generateDailyChart(), another method (indirectly) using the drawSlice method, also has the graphics state issue and has to be fixed similarly:

private void generateDailyChart(){
    mContents.saveGraphicsState();
    int totalVal = 0;
    try {
        mContents.transform(Matrix.getTranslateInstance(375, 525));
    }catch(IOException e){}

    for (int i=1; i< 8;i++){
        totalVal += mDayPercent[i];
        writeTheChart(mDayPercent[i-1], totalVal,mColor[i]);
        log.info("Color selected = " +mColor[i] +"Index = "+i);
    }
    mContents.restoreGraphicsState();
}

As it currently is commented out and, therefore, not used, this problem does not show yet, though.

Question:

I have a CodenameOne project which is mostly done, I need to integrate the desktop version with PDFBox. I am trying to open a document and render pages to images which I display in the application.

I get the following error:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
    at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:213)
    at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:607)
    at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:59)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
    at co.za.gingetsuryuu.pdfreader.PDFInterfaceImpl$1.run(PDFInterfaceImpl.java:69)
    at java.lang.Thread.run(Thread.java:745)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
    at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:213)
    at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:607)
    at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:59)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
    at co.za.gingetsuryuu.pdfreader.PDFInterfaceImpl$1.run(PDFInterfaceImpl.java:69)
    at java.lang.Thread.run(Thread.java:745)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDTrueTypeFont
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
    at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:213)
    at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:607)
    at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:59)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
    at co.za.gingetsuryuu.pdfreader.PDFInterfaceImpl$1.run(PDFInterfaceImpl.java:69)
    at java.lang.Thread.run(Thread.java:745)

I'm at a total loss. I have checked that the libraries are included, it manages to render one or two of the pages, but no more than that. I've even checked that the function does exist, and it does.


Answer:

The problem appears to be with the way Codename One does the build process. By including the jar files in the result project and making them load higher in the jar load order than the app.jar file, the issue is resolved.

Question:

final String NBSP = new String("\u00a0");
contentStream.showText("Konichua!" + NBSP);

this throws the following exception:

java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding

I have tried it with all the 3 fonts available; TimesNewRoman, Courier & Helvitica with all 3 resulting in the same exception.

But when you look at the WIN_ANSI_ENCODING_TABLE present in the source code of apcahe pdfBox,

    {040, "space"},
    {0243, "sterling"},
     .
     .

    // adding some additional mappings as defined in Appendix D of the pdf spec
    {0240, "space"},
    {0255, "hyphen"}

we can see that the below Non-breaking space is defined.

DEC     OCT     HEX     BIN       Symbol     Description
160    240      A0      10100000              Non-breaking space

In the pdf specification document the following is quoted too:

The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding . This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE.


Use case:

To increase the width of header txt columns by adding padding with NBSP, so that it is not removed when string.trim() is called on header columns.


Answer:

"it shall be typographically the same as (U+003A) SPACE".

So it doesn't have the nbsp / nbspace. Get your font by calling PDType0Font.load(document, new File("...")).

Btw, calling new String() on a string is not a good idea.

Question:

I have a problem with decrypting a PDF document with Apache PdfBox (v1.8.2) lib. Encryption works, but decryption with the same password throws an exception. (Java 1.6)

package com.test;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
import org.apache.pdfbox.pdmodel.encryption.StandardProtectionPolicy;

public class PdfEncDecTest {

    static String pdfPath = "G:\\files\\filed5b3.pdf";
    public final static String PDF_OWNER_PASSWORD = "cd1j";
    public final static String PDF_USER_PASSWORD = "";  

    public static void main(String[] args) throws Exception {

        PDDocument document = PDDocument.load(pdfPath);
        AccessPermission ap = new AccessPermission();
        ap.setCanPrint(true);
        ap.setCanExtractContent(false);
        ap.setCanExtractForAccessibility(false);
        StandardProtectionPolicy spp = new StandardProtectionPolicy(PDF_OWNER_PASSWORD, PDF_USER_PASSWORD, ap);
        document.protect(spp);
        document.save(pdfPath+".pdf");
        document.close();

        PDDocument doc = PDDocument.load(pdfPath+".pdf");
        if(doc.isEncrypted()) {
            StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(PDF_OWNER_PASSWORD);
            doc.openProtection(sdm); // org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
            doc.decrypt(PDF_OWNER_PASSWORD); // the same like above
        }
        doc.close();
    }

}

I don't know what is wrong. With version 1.8.7 I get the same exception. I've posted the full code above.

Exception in thread "main" org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.prepareForDecryption(StandardSecurityHandler.java:265)
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:156)
    at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1595)
    at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:942)
    at com.test.PdfEncDecTest.main(PdfEncDecTest.java:29)

I've put sample project to github: https://github.com/marioosh-net/pdfbox


Answer:

You need the user password.

    if (doc.isEncrypted())
    {
        StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(PDF_USER_PASSWORD);
        doc.openProtection(sdm);
        // don't call decrypt() here
    }

this works even if the user password is not null. The user password is for what the ordinary human thinks encryption is, the owner password is an encryption for the security rights.

edit: sorry, my answer is wrong, although it was helpful. You can open a PDF with the user password (you'll possibly get restricted rights) or with the owner password (you'll get full rights). What may have happened is that there is a bug with matching the owner password with 40bit keys (which is the default). This bug is currently being investigated, see PDFBOX-2456 and search for "MD5".