Hot questions for Using PDFBox in encoding

Question:

When trying to print a PDF page using Java and the org.apache.pdfbox library, I get this error:

PDFBOX : U+000A ('controlLF') is not available in this font Helvetica encoding: WinAnsiEncoding


Answer:

[PROBLEM] The String you are trying to display contains a newline character.

[SOLUTION] Replace the String with a new one and remove the newline:

text = text.replace("\n", "").replace("\r", "");

Question:

I am facing a problem when invoking the setValue method of a PDField and trying to set a value which contains special characters.

field.setValue("TEST-BY  (TEST)")

In detail, if my value contains characters as U+00A0 i am getting the following exception:

Caused by: java.lang.IllegalArgumentException: U+00A0 is not available in this font's encoding: WinAnsiEncoding

A complete stracktrace can be found here: Stacktrace

I currently have set PDType1Font.TIMES_ROMAN as font. In order to solve this problem i tried with other available fonts as well. The same problem persisted.

I found the following suggestion in this answer https://stackoverflow.com/a/22274334/7434590 but since we use the setValue and not any of the methods showText/drawText that can manipulate bytes, i could not use this approach since setValue accepts only string as a parameter.

Note: I cannot replace the characters with others to solve this issue, i must be able to set any kind of supported by the font character in the setValue method.


Answer:

You'll have to embed a font and not use WinAnsiEncoding:

PDFont formFont = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/somefont.ttf"), false); // check that the font has what you need; ARIALUNI.TTF is good but huge
PDResources res = acroForm.getDefaultResources(); // could be null, if so, then create it with the setter
String fontName = res.add(formFont).getName();
String defaultAppearanceString = "/" + fontName + " 0 Tf 0 g"; // adjust to replace existing font name
textField.setDefaultAppearance(defaultAppearanceString);

Note that this code must be ran before calling setValue().

More about this in the CreateSimpleFormWithEmbeddedFont.java example from the source code download.

Question:

So I am using PDFBox to fill in some pdfs. So far everything was great - I created a form in pdf with Avenir Light font, and I could fill it in. However, the problem that just now showed up, is that when I am trying to fill the pdf using letters such as ł, ą, ć ... I get the following error:

U+0142 is not available in this font's encoding: MacRomanEncoding with differences

with different numbers.

Now, my question is - how can I fix this, so that I can fill the form automatically? When I open the pdf in Acrobat Reader, I can insert those letters, and I dont get any errors. Here is how I set the field:

public void setField(PDDocument document, PDField field, String value ) throws IOException {
    if( field != null && value != null) {
        try{
            field.setValue(value);
        } catch (Exception e){
            e.printStackTrace();
        }
    }
    else {
        System.err.println( "No field found with name:" + field.getPartialName() );
    }
}

UPDATE

I've been trying to upload my own Avenir-Light.tff like this:

PDFont font = PDType1Font.HELVETICA;
PDResources res = new PDResources();
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
String da = "/" + fontName.getName() + " 12 Tf 0 g";
acroForm.setDefaultAppearance(da);

However, this doesn't seem to have any impact on the printed fields, and throws almost the same message:

U+0104 ('Aogonek') is not available in this font Helvetica (generic: ArialMT) encoding: WinAnsiEncoding

Answer:

PDFBox define 14 standard fonts in PDType1Font :

PDType1Font.TIMES_ROMAN PDType1Font.TIMES_BOLD PDType1Font.TIMES_ITALI PDType1Font.TIMES_BOLD_ITALIC PDType1Font.HELVETICA PDType1Font.HELVETICA_BOLD PDType1Font.HELVETICA_OBLIQUE PDType1Font.HELVETICA_BOLD_OBLIQUE PDType1Font.COURIER PDType1Font.COURIER_BOLD PDType1Font.COURIER_OBLIQUE PDType1Font.COURIER_BOLD_OBLIQUE PDType1Font.SYMBOL PDType1Font.ZAPF_DINGBATS

So if you want to use Avenir-Light you have to load it from a .ttf file. You can do this as @TilmanHausherr suggested PDType0Font.load(doc, new File("path/Avenir-Light.ttf"), false).

PDFont font = PDType0Font.load(doc, new File("path/Avenir-Light.ttf"), false);
PDResources res = new PDResources();
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
String da = "/" + fontName.getName() + " 12 Tf 0 g";
acroForm.setDefaultAppearance(da);

Update

Do you know why it also displays a warning if form of: OpenType Layout tables used in font Avenir-Light are not implemented in PDFBox and will be ignored?

Avenir-light font uses OpenType Layout tables (Advanced Typographic) that PDFBox does not support yet. This advaned typographics will be ignored

Question:

I have to

  1. extract text from a pdf, where i roughly use this

    f = IOUtility.getFileForPath(filePath);
    RandomAccessFile randomAccessFile = new RandomAccessFile(f, "r");
    PDFParser parser = new PDFParser(randomAccessFile);
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    pdfStripper.setStartPage(1);
    pdfStripper.setEndPage(pdDoc.getNumberOfPages());
    String parsedText = pdfStripper.getText(pdDoc);
    
  2. scale the PDF

    File PDFFile = IOUtility.getFileForPath(scaleConfig.getFilePath());
    document = PDDocument.load(PDFFile);
    
    for (PDPage page : document.getPages()) {
        PDRectangle cropBox = page.getCropBox();
        float tx = ((cropBox.getLowerLeftX() + cropBox.getUpperRightX()) * 0.03f) / 2;
        float ty = ((cropBox.getLowerLeftY() + cropBox.getUpperRightY()) * 0.03f) / 2;
        PDPageContentStream cs = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.PREPEND, false, false);
        cs.transform(Matrix.getScaleInstance(0.97f, 0.97f));
        cs.transform(Matrix.getTranslateInstance(tx, ty));
        cs.close();
    }
    document.save(scaleConfig.getTargetFilePath());
    
  3. and finally write something on every page of the pdf. I use one of the 14 supported Fonts mentioned here https://pdfbox.apache.org/1.8/cookbook/workingwithfonts.html. Times New Roman in this case.

    File PDFFile = IOUtility.getFileForPath(writeConfig.getFilePath());
    document = PDDocument.load(PDFFile);
    for (PDPage page : document.getPages()) {
        PDFBoxHelper.fixRotation(document, page);
        writeStringOnPage(document, page, writeConfig);
    }
    document.save(writeConfig.getTargetFilePath());
    

    with writeStringOnPage doing

    contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, false, true);
    WriteCoordinates writeCoordinates = WriteCoordinateFactory.buildCoordinates(writeConfig, page.getMediaBox());
    contentStream.beginText();
    // lower left x and lower left y are different after rotation so use those for your calculation
    contentStream.newLineAtOffset(writeCoordinates.getX(), writeCoordinates.getY());
    contentStream.setFont(writeConfig.getFont(), writeConfig.getFontSize());
    contentStream.setNonStrokingColor(writeConfig.getFontColor());
    contentStream.showText(writeConfig.getToWrite());
    contentStream.endText();
    

I left out signatures and catch blocks because of company reasons. I always close the contentstreams.

Most of the time the processed PDFs look fine both in Chrome PDF-Viewer, Acrobat Reader and also after importing them into BMD. But in some specific cases i seem to have encoding issues and certain parts are not displayed correctly. All the text I add on the PDF is correctly displayed all of the time.

I realized only boldly printed text in the PDF is displayed wrong so I used Adobe Acrobat Reader to look at the fonts used.

Arial and Arial,Bold are Embedded and encoded with Identity-H. As Everything written bold I concluded that all text written in Arial,Bold is displayed wrong. Everything else is still fine after processing the pdf. I can not add the pdf because it has customer data but here are some examples:

  1. Rechnungs-Nr: --> 5HFKQXQJV1U
  2. 60 Tage netto (27.12.2019) -> 7DJHQHWWR

If the PDF is imported in BMD without PDFBox-manipulation it is displayed correctly.

I tried to narrow the problem down by only scaling and only writing but the problem occurred both times.

I am using PDFBox 2.017 and Java 8.

As the error also occurs when i am only scaling the pdf I used PDFDebugger to compare the original PDF:

and the pdf after i scaled it:

The only thing that seems different/off is the Contents Entry.

When I open the scaled PDF an click on the Fonts section and on the Arial,Bold font i get a lot of warnings about unicode mappings. The PDF is correctly displayed in PDFDebugger though.

I am neither an expert with PDFBox nor with fonts and encodings so any help would be greatly appreciated!


Answer:

In short

The relevant difference is that PDFBox serializes names differently. But the different outputs according to the PDF specification are equivalent, so you apparently have uncovered a WPViewPDF bug.

The difference in writing names

In the original PDF (raw.pdf) you find the names NOWFJV+Arial,Bold and NOWFJV+Arial,Bold-WinCharSetFFFF, in all files manipulated by PDFBox you find all occurrences of those names outside of content streams replaced by NOWFJV+Arial#2CBold and NOWFJV+Arial#2CBold-WinCharSetFFFF.

WPViewPDF cannot properly display the text written in the fonts with these changed names. After patching the PDFs back to contain a comma in place of the '#2C' in those names, WPViewPDF again properly displays such text.

I would assume WPViewPDF finds NOWFJV+Arial,Bold in the content stream and expects to find the matching font definition in the page resources using the identically written name, so it doesn't recognize it with the name NOWFJV+Arial#2CBold.

Is that a PDFBox bug?

According to the PDF specification,

Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

(ISO 32000-2, section 7.3.5 "Name objects")

Thus, this replacement of commas in names by '#2C' sequences is a completely valid alternative way to write those names.

Thus, no, it's not a PDFBox bug but apparently a WPViewPDF bug.

Question:

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;
public class sample {
public static void main(String[] args) throws InvalidPasswordException, IOException {
    File file = new File("C:\\sample.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    //java.io.PrintStream p = new java.io.PrintStream(System.out,false,"Cp921");
    //p.println(text.toString());
    System.out.println(text);
    }
}

The text is read from the pdf but while displaying using System.out.println it shows a different output. Then I read different posts online and found that it had something to do with encoding and I found a solution at this question: Text extracted by PDFBox does not contain international (non-English) characters but I had to use encoding of Cp921 for Latvian characters but still I have the problem not solved and the output is given in this image

Then I went through the process of debugging and found that the text read from PDF is stored in exact encoding without any changes so I don't know how to display the text with correct encoding. Any help would be great thanks in advance.

Sample PDF content: [Maksātājs, Informācija, Vārdu krājums, Ēģipte, Plašs, Vājš, Brieži, Pērtiķi, Grāmatiņa, šķīvis]

Console output in Eclipse using System.out.println:

Console output in eclipse using PrintStream:

P.S. I am beginner programmer and I have not much experience in coding


Answer:

You can change the system out either by modifying the system property file.encoding or by setting the out. Any of the following should work:

  1. -Dfile.encoding=utf-8 (or whatever you need) as a jvm argument
  2. System.setProperty("file.encoding", "utf-8") -- same as (1) but at runtime
  3. System.setOut(new PrintStream(System.out, true, "utf-8")) -- set System.out to whatever print stream you need.

EDIT

Your comment mentions you're writing to a file. To write to a file and specify the encoding, consider something like

try (OutputStreamWriter writer =
         new OutputStreamWriter(new FileOutputStream(new File("path/to/file")), StandardCharsets.UTF_8))
    writer.write(text, 0, text.length());
}

See the documentation here.

Question:

I'm using PDFBox 2.0.1.

I try to dynamically add some (user provided) UTF8 text to the form fields and show the result to the user. Unfortunately either the pdf library is not capable of properly encoding special characters such as "äöü"... or I was not able find any useful documentation that could help me with this issue.

Can someone tell me what is wrong with the given code sample?

try (PDDocument document = PDDocument.load(pdfTemplate)) {
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    PDAcroForm form = catalog.getAcroForm();

    List<PDField> fields = form.getFields();
    for (PDField field : fields) {
        switch (field.getPartialName()) {
            case "devices":
                // Frontend (JS): userInput = btoa('Gerät')
                String userInput = ...
                String name = new String(Base64.getDecoder().decode(base64devices), "UTF-8");
                field.setReadOnly(true);
                break;
        }
    }
    form.flatten(fields, true);
    document.save(bos);
}

And here the stacktrace of the error:

java.lang.IllegalArgumentException: U+FFFD is not available in this font's encoding: WinAnsiEncoding
    org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.encode(PDTrueTypeFont.java:368)
    org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:286)
    org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:315)
    org.apache.pdfbox.pdmodel.interactive.form.PlainText$Paragraph.getLines(PlainText.java:169)
    org.apache.pdfbox.pdmodel.interactive.form.PlainTextFormatter.format(PlainTextFormatter.java:182)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.insertGeneratedAppearance(AppearanceGeneratorHelper.java:373)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceContent(AppearanceGeneratorHelper.java:237)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(AppearanceGeneratorHelper.java:144)
    org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:263)
    org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.refreshAppearances(PDAcroForm.java:324)
    org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.flatten(PDAcroForm.java:213)
    my.application.service.PDFService.generatePDF(PDFService.java:201)

I also found those (related) issues on SO:

pdfbox: ... is not available in this font's encoding But that does not help me choose the right encoding or how. IIRC Java uses UTF16 internally for character encoding why is the default not enough though? Is that an issue of the PDF-document itself or the code I use to set it?


PdfBox encode symbol currency euro Well its dynamic user input, so there are way to many things I would have to replace myself.

Thus, if the PDFBox people decided to fix the broken PDFBox method, this seemingly clean work-around code here would start to fail as it would then feed the fixed method broken input data.

Admittedly, I doubt they will fix this bug before 2.0.0 (and in 2.0.0 the fixed method has a different name), but one never knows...

Unfortunately I was not able to find this other setter method, but it might also be a different scope it does apply to.

EDIT

Updated example code to better represent the problem.


Answer:

U+FFFD is used to replace an incoming character whose value is unknown or unrepresentable in Unicode compare the use of U+001A as a control character to indicate the substitute function (source).

That said it is likely that that character gets messed up somewhere. Maybe the encoding of the file is not UTF-8 and that's why the character is messed up.

As a general rule you should only write ASCII characters in the source code. You can still represent the whole Unicode range using the escaped form \uXXXX. In this case ä -> \u00E4.

-- UPDATE --

Apparently the problem is in how the user input get encoded/decoded from client/server side using the JS function btoa. A solution to this problem can be found at this link:

Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

Question:

I have a PDF template & trying to replace some words in it. I use this code:

private PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
    if (searchString.isEmpty() || replacement.isEmpty()) {
        return document;
    }
    PDPageTree pages = document.getDocumentCatalog().getPages();
    for (PDPage page : pages) {
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List<Object> tokens = parser.getTokens();
        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    if (searchString.equals(string)) {
                        System.out.println(string);
                    }
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            if (searchString.equals(string)) {
                                System.out.println(string);
                            }
                            string = StringUtils.replaceOnce(string, searchString, replacement);
                            cosString.setValue(string.getBytes());
                        }
                    }
                }
            }
        }
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream();
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        page.setContents(updatedStream);
        out.close();
    }
    return document;
}

My PDF template have only 3 strings: "file:///C/Users/Mi/Downloads/converted.txt", "[10.03.2020 18:43:57]" and "hello!!!". First 2 strings searching correctly, but third looks like "KHOOR...":

There is an encoding mismatch, as I understand. When I try to replace "file:///C/Users/Mi/Downloads/converted.txt" with "Hello!", it replaces as "ello", not shows uppercases and marks. As I understand, key difference is in fonts. "hello" have font settings, others not have.

Source PDF is here: https://yadi.sk/i/l0OAcFkAkUHKYg

Please, advice, how to get text from PDF as correct strings and to replace it.


Answer:

This answer is actually an explanation why a generic solution for your task is at least very complicated if not impossible. Under benign circumstances, i.e. for PDFs subject to specific restrictions, code like yours can be successfully used, but your example PDF shows that the PDFs you apparently want to manipulate are not restricted like that.

Why automatic replacement of text is difficult/impossible

There are a number of factors that impede automatic replacement of text in PDFs, some already making finding the instructions for drawing the text in question difficult, and some complicating the replacing the characters in the arguments of those instructions.

The list of problems illustrated here is not exhaustive!

Finding instructions drawing a specific text

PDFs contain content streams which contain sequences of instructions telling a PDF processor where to draw what. Regular text in PDFs is drawn by instructions setting the current font (and font size), setting the position to draw the text at, and actually drawing text. This can be as easy to understand and search for as this:

/TT0 1 Tf
9 0 0 9 5 5 Tm
(file:///C/Users/Mi/Downloads/converted.txt[10.03.2020 18:43:57]) Tj 

(Here the font TT0 with size 1 is selected, then an affine transformation is applied to scale text by a factor of 9 and move to the position (5, 5), and finally the text "file:///C/Users/Mi/Downloads/converted.txt [10.03.2020 18:43:57]" is drawn.)

In such a case searching the instructions responsible for drawing a given piece of text is easy. But the instructions in question may also look differently.

Split lines

For example the string may be drawn in pieces, instead of the Tj instruction above, we may have

[(file:///C/Users/Mi/Downloads/converted.txt)2 ([10.03.2020 18:43:57])] TJ

(Here first "file:///C/Users/Mi/Downloads/converted.txt" is drawn, then the text drawing position is slightly moved, then "[10.03.2020 18:43:57]" is drawn, both in the same TJ instruction.)

Or you may see

(file:///C/Users/Mi/Downloads/converted.txt) Tj
([10.03.2020 18:43:57]) Tj 

(The text parts drawn in different instructions.)

Also the order of text pieces may be unexpected:

([10.03.2020 18:43:57]) Tj 
-40 0 Td
(file:///C/Users/Mi/Downloads/converted.txt) Tj

(First the date string is drawn, then the text position is moved left quite a bit before the drawn date, the the URL is drawn.)

Some PDF producers draw each character separately, setting the whole text transformation in between:

9 0 0 9 5 5 Tm
(f) Tj
9 0 0 9 14 5 Tm
(i) Tj
9 0 0 9 23 5 Tm
(l) Tj
...

And these different instructions need not be arranged in sequence as here, they can be spread over the whole stream, even over multiple streams as a page can have an array of content streams instead of a single one or part of the string may be drawn in the content stream of a sub-object referenced from the page content stream.

Thus, for finding the instructions responsible for a specific, multi-character text, you may have to inspect multiple streams and glue the strings you found together according to the position they have been drawn at.

Ligatures

Not every single character code might correspond to a single character as in your search string. There are a number of special glyphs for combinations of characters like for fl etc. So for searching one has to expand such ligatures.

Encodings

In the examples above, the characters of the text were easy to recognize even if the text was not drawn in a single run. But in PDFs the encoding of the characters need not be so obvious, actually each font may come with an own encoding, e.g.

<004B0048004F004F0052000400040004>Tj 

can draw "hello!!!".

(Here the string argument is written as hex string, in the debugger you saw "KHOOR...".)

Thus, for searching text, one needs to first map the string arguments of text drawing instructions to Unicode depending on the specific encoding of the current font.

But the PDF does not need to contain a mapping from the individual codes to Unicode characters, there may only be a mapping to the glyph id in the font file. In case of embedded fonts files, these font files then don't need to contain any mapping to Unicode characters either.

Often PDF files do have information on the Unicode characters matching the codes to allow text extraction e.g. for copy/paste; strictly speaking, though, such information is optional; even worse, that information may contain errors without creating issues when displaying the PDF. In all such situations one has to use OCR like mechanisms to recognize the Unicode characters associated with each glyph.

Replacing text in instructions

Once you found the instructions responsible for drawing the text you searched, you have to replace the text. This may also imply some problems.

Subset fonts

If font files are embedded in a PDF, they often merely are embedded as subsets of the original fonts to save space. E.g. in your example PDF the font Tahoma used to display "hello!!!" only is embedded with the following glyphs:

Even Times New Roman (the font used for the text you could recognize) is only subset embedded with the following glyphs:

Thus, even if you found the "hello!!!" in Tahoma, simply replacing the character codes to mean "byebye??" would only display " e e " as the only character for which a glyph is present in the embedded font is the 'e'.

Thus, to replace you may either have to edit the embedded font file and the representing PDF font object to contain and encode all required glyphs, or to add another font and instructions to switch to that font for the manipulated text drawing instructions and back again thereafter.

Font encodings

Even if your font is not embedded at all (so your complete local copy of the font will be used) or embedded with all the glyphs you need, the encoding used for your font may be limited. In Western European language based PDFs you will often find WinAnsiEncoding, an encoding similar to Windows code page 1252. If you want to replace with Cyrillic text, there are no character codes for those characters.

Thus in this case you might have to change the encoding to include all the characters you need (by finding unused characters in the present encoding by scanning all uses of the font in question) or add another font with a more apropos encoding.

Layout considerations

If your replacement text is longer or shorter than the replaced text and there is other text following on the same line in the PDF, you have to decide whether that text should be moved, too, or not. It may belong together and has to be shifted accordingly, but it may alternatively be from a separate text block or column in which case it should not be moved.

Text justification may also be damaged.

Also consider marked text (underline / strike through / background color / ...). These markings in PDF (usually) are not font properties but separate vector graphics. To get these right, you have to parse the vector graphics and annotations from the page, heuristically identify text markings, and update them.

Tagged PDFs

If you deal with tagged PDFs (e.g. for accessibility), this may make finding text easier (as accessibility should allow for easy text extraction) but replacing text harder because you may also have to update some tags or structure tree data.

How to implement a generic text replacement nonetheless

As shown above there are a lot of hindrances to text replacement in PDFs. Thus, a complete solution (where possible at all) is far beyond the scope of a stack overflow answer. Some pointers, though:

To find the text to replace you should make use of the PdfTextStripper (a PDFBox utility class for text extraction) and extend it to have all the text with pointers to the text drawing instruction that draws each character respectively. This way you don't have to implement all the decoding and sorting of the text.

To replace the text you can ask the PDFBox font classes (provided by the PdfTextStripper if extended accordingly) whether they can encode your replacement text.

And always have a copy of the PDF specification (ISO 32000-1 or ISO 32000-2) at your hands...

But do be aware that it will take you a while, a number of weeks or months, to get a somewhat decent generic solution.

Question:

I have a (quite simple) java Spring Boot/REST service that renders PDF from input and testing it with IntelliJ.

I use pdfbox as the tool to create such pdfs.

One feature is that the client can give annexes as byte[] in addition to the regular content it wants.

Problem

When users tries the service, the final document has blank pages only for the annexes part.

Investigation
  • Tried with IntelliJ and HTTP REST Client and got the same issue
  • Saving the annexes into a separate files give a clear and correct document
  • Saving the whole document (regular content + annexes) into a file is correct as well.
  • Using postman, the document is fine....

When I notice that with postman it's working great, I changed the IntelliJ default file encoding for the response file that is generated (from UTF-8 to ISO-8859-1) and then successive documents are clear and correct... Don't forget that this problem seems to only affect annexes. The regular content is always fine.

Question
  • I suppose this is an encoding problem in annexes content. am I correct ?
  • Any way can i handle this on my side without impacting users service? Meaning to avoid some dev on their side.
Other Information

I tried many bytes conversion without success, for instance:

new String(annexe, StandardCharsets.ISO_8859_1).getBytes(StandardCharsets.UTF_8);

But each time I got an exception:

java.io.IOException: java.util.zip.DataFormatException: invalid stored block lengths

The document is sent back as byte[] like this:

ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
pdfDocument.save(outputStream);
pdfDocument.close();
return outputStream.toByteArray();

Saving the document into a file is quite the same code, just a FileOutputStream is given instead.

Annexes are added to the document like this:

for(byte[] content : annexes) {
    PDDocument annex = PDDocument.load(content);
    for (PDPage page : annex .getPages()) {
        pdfDocument.importPage(page);
    }
}

I also tried the PDFMergerUtility but got the same result (blank pages for annexes)


Answer:

Thanks to Tilman Hausherr suggestion, I tried to encode the byte[] with Base64.getEncoder().encode(...) and this does the work!

The client has to deal with a Base64 encoded string now but it works at least.

Thank you!

Question:

I am creating pdf document using pdfbox library version 2.0.2 How to write bullet character in pdf Is EncodingManager class removed from version 2 as I cannot find it with 2.0.2 jar

Also is it possible to write bullet character that are available in MS Word for example


Answer:

Do this:

    stream.setFont(PDType1Font.HELVETICA, 12);
    stream.showText("\u2022"); // bullet
    stream.setFont(PDType1Font.ZAPF_DINGBATS, 12);
    stream.showText("\u27A2"); // three-d top-lighted rightwards arrowhead

However the "three-d top-lighted rightwards arrowhead" will only be available from PDFBox version 2.0.3 upwards. It has not yet been released, but you can test it from here: https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.3-SNAPSHOT/

(In theory, the second character should also work with the wingdings font by using PDType0Font.load, but it doesn't)