Hot questions for Using PDFBox in fonts

Question:

I'm trying to fill out a bunch of PDF Forms using PDFBox 2.0.8. For some documents I get the following error when setting the PDTextField's value:

java.io.IOException: Could not find font: /ArialMT


Apparently the font is not correctly embedded as is often the case with proprietary Microsoft fonts.

How can I tell PDFBox to substitute the font e.g. with "normal" Arial or some other font? Setting the fields DA string to "/Helv 0 tf 0 g" resulted in a NullPointerException.

Based on the comments from Tilman Hausherr I built a first fix which works independent from the operating system (which is a Linux in my case).

acroForm.defaultResources.put(COSName.getPDFName("ArialMT"),


This will only work for this particular font, though. What's still missing - and was actually the main intention of my question - is an option to tell PDFBox to fall back to a certain font resp. DA if the font that is required cannot be provided.

After Tilman again came for the rescue I can now present the complete solution. Again, this is Kotlin, not Java:

PDDocument.load(file).use { pdDocument ->
val acroForm = pdDocument.documentCatalog.acroForm
acroForm.defaultResources.put(COSName.getPDFName("ArialMT"),
val pdField: PDField? = acroForm.getField(fieldname)
val value = ...
when (pdField) {
is PDCheckBox -> {
if (value is Boolean) {
when (value) {
true -> pdField.check()
false -> pdField.unCheck()
}
} else {
log.error("RENDER_FORM: Need Boolean for ${pdField.fullyQualifiedName} but got$value")
}
}
is PDTextField -> {
try {
pdField.value = value?.toString() ?: ""
} catch (ioException: IOException) {
pdField.cosObject.setString(COSName.DA, "/Helv 0 Tf 0 g")
pdField.value = value?.toString() ?: ""
log.error("RENDER_FORM: Writing text field failed: ${ioException.message}") } } null -> { log.error("RENDER_FORMULAR: Formfield$fieldname does not exist in $name") } else -> log.error("RENDER_FORMULAR: Formfield$pdField ($fieldname) is of unhandled type${pdField.fieldType}")
}

val stream = ByteArrayOutputStream()
pdDocument.save(stream)
pdDocument.close()
return stream.toByteArray()
}


Question:

I have been searching the net and Stack Overflow for the past month on the issue that I am having with a web application that generates PDF files. I am utilizing PDFBox app-2.0.4.jar and embedding text on these PDF files, specifically with a PDType1Font object only. These objects are passed as parameters either as PDType1Font.HELVETICA OR TIMES_ROMAN. The web application is hosted on Apache Tomcat and The thing is these PDF files get generated without any issues and are visually correct (bold/height/font-type) all throughout the file, however when I check the web app's log, it outputs the following:

<DEBUG 2017-05-17 00:13:19,270 - FontFileFinder - checkFontfile found C:\Windows\FONTS\vijayab.ttf
<DEBUG 2017-05-17 00:13:19,270 - FontFileFinder - checkFontfile check C:\Windows\FONTS\vrinda.ttf
<DEBUG 2017-05-17 00:13:19,270 - FontFileFinder - checkFontfile found C:\Windows\FONTS\vrinda.ttf
<DEBUG 2017-05-17 00:13:19,271 - FontFileFinder - checkFontfile check C:\Windows\FONTS\vrindab.ttf
<DEBUG 2017-05-17 00:13:19,271 - FontFileFinder - checkFontfile found C:\Windows\FONTS\vrindab.ttf
<DEBUG 2017-05-17 00:13:19,271 - FontFileFinder - checkFontfile check C:\Windows\FONTS\webdings.ttf
<DEBUG 2017-05-17 00:13:19,271 - FontFileFinder - checkFontfile found C:\Windows\FONTS\webdings.ttf
<DEBUG 2017-05-17 00:13:19,272 - FontFileFinder - checkFontfile check C:\Windows\FONTS\wingding.ttf
<DEBUG 2017-05-17 00:13:19,272 - FontFileFinder - checkFontfile found C:\Windows\FONTS\wingding.ttf
<DEBUG 2017-05-17 00:13:19,289 - FileSystemFontProvider - Loaded TimesNewRomanPSMT from C:\Windows\FONTS\times.ttf
<DEBUG 2017-05-17 00:13:19,290 - FileSystemFontProvider - Loaded TimesNewRomanPS-BoldMT from C:\Windows\FONTS\timesbd.ttf
<DEBUG 2017-05-17 00:13:19,291 - FileSystemFontProvider - Loaded TimesNewRomanPS-ItalicMT from C:\Windows\FONTS\timesi.ttf
<DEBUG 2017-05-17 00:13:19,292 - FileSystemFontProvider - Loaded TimesNewRomanPS-BoldItalicMT from C:\Windows\FONTS\timesbi.ttf
<DEBUG 2017-05-17 00:13:19,292 - FileSystemFontProvider - Loaded ArialMT from C:\Windows\FONTS\arial.ttf
<DEBUG 2017-05-17 00:13:19,293 - FileSystemFontProvider - Loaded Arial-BoldMT from C:\Windows\FONTS\arialbd.ttf
<DEBUG 2017-05-17 00:13:19,294 - FileSystemFontProvider - Loaded Arial-ItalicMT from C:\Windows\FONTS\ariali.ttf
<DEBUG 2017-05-17 00:13:19,295 - FileSystemFontProvider - Loaded Arial-BoldItalicMT from C:\Windows\FONTS\arialbi.ttf


To my understanding PDFBox comes preinstalled with its own font package so why am I getting these warnings?

1) "DEBUG" means it is a debug log entry. You have set logging to DEBUG level. Set it to "WARNING" and they will get away.

2) "To my understanding PDFBox comes preinstalled with its own font package" - no it doesn't, PDFBox has only one font (Liberation Sans Regular) as a worst case fallback. What you see is PDFBox collecting information about what fonts are installed.

3) The current PDFBox version is 2.0.6.

Question:

I used PDFBox (2.0.11) to create/edit PDFs and struggled with the usage of two fonts that always lead to an exception as follows

This font does not permit subsetting


Even though it is possible to subset the fonts with other tools like Everything fonts without any issues.

Is it possible to use a font with PDFbox without subsetting it or are there any other ways to solve this problem?

Exception message:

Exception in thread "main" java.io.IOException: This font does not permit subsetting
at org.apache.pdfbox.pdmodel.font.TrueTypeEmbedder.subset(TrueTypeEmbedder.java:298)
at org.apache.pdfbox.pdmodel.font.PDType0Font.subset(PDType0Font.java:239)


SOLVED:

Here is a working example on how to load a font without subsetting it:

File fontFile1 = new File("/fonts/fontfile.ttf");
InputStream fontFile1Stream = new FileInputStream(fontFile1);
PDType0Font product_title_font = PDType0Font.load(doc, fontFile1Stream, false);


Yes, you can still use the font without subsetting, use

PDType0Font.load(PDDocument doc, InputStream input, boolean embedSubset)


with the last parameter = false. Your files will be bigger, that's all. If another product can subset the font, then it means either that it doesn't respect the license settings, or that there's a bug in PDFBox. Open your font in a tool that can display the os2 table, e.g. DTL OTMaster Light. There look for the "fstype" entry. https://docs.microsoft.com/en-us/typography/opentype/spec/os2#fstype

Question:

Here is a simple offending pdf.

When I run DrawPrintTextLocations below is what I see..

But as far as I understand the bounding box(in blue above) should be representative of the grey area that shows up in any pdf reader when you select the text, like below.

If a pdf reader is able to figure out the grey area to show for highlighting, one should be able to figure out the same and thus get to the font size(?). This question is for anyone to point me in the right direction.

Following are the details of "T" in the text "Test Line." from its TextPosition object variable text:

72.4801          //text.getXDirAdj()
1.0              //text.getFontSize()
50.0             //text.getFontSizeInPt()   ::I'm unable to decipher the 50.0
12.0             //text.getXScale()         ::Can I assume this to be the font size
8.004            //text.getHeightDir()
7.8984           //text.getWidthOfSpace()
950.0            //fontDesc.getAscent()
-222.0           //fontDesc.getDescent()
[x=72.4801,y=75.7560,w=7.1160,h=8.0040]
//Red Box boundaries
[x=72.4801,y=46.3560,w=7.1160,h=66.9600]    //The height of 66.96 relates to 50 but not sure how?
//Blue Bounding Box boundaries


Questions: 1. Bounding Box Issue: Seems like this is not consistent when I call font.getBoundingBox(). Is there a work around for this? 2. getFontSizeInPts(): This method seems to be influenced by the bounding box. Am I right in thinking so?(As the Font Size in Pt is showing as 50) 3. What is the way to get FontSize in points?

I need the font size as I have a task to recreate pdfs by using a different fonts.

Also here is a case of the correct pdf but the font size shows up as 16 instead of 12 which was used initially.

For similar pdf with proper bounding box, below are the details:-

Output from DrawPrintTextLocations is

Following are the details of "T" in the text "Test Line." from its TextPosition object variable text:

72.0605           //text.getXDirAdj()
16.0              //text.getFontSize()      :: Why is this showing 16 while my font is 12 in size
16.0              //text.getFontSizeInPt()
12.0101           //text.getXScale()        ::Can I assume this to be the font size
6.6618            //text.getHeightDir()
2.6447            //text.getWidthOfSpace()
778.808           //fontDesc.getAscent()    :: There seems to be an issue with the ascent
-222.1680         //fontDesc.getDescent()
[x=72.0605,y=76.6581,w=7.1193,h=6.6618]
//Red Box boundaries
[x=72.0605,y=72.6176,w=7.1193,h=13.3237]    //The height of 13.3237 relates to 12 the font size but not sure how?
//Blue Bounding Box boundaries


UPDATED AFTER USING MKL's ANSWER The below is what worked for me...

//Make Line
Line2D.Float line = new Line2D.Float(0,0,0,1f);
LOG.debug("Line<Before Transform>:" + line.getBounds2D());
s=myTextMatrix.createAffineTransform().createTransformedShape(line);
LOG.debug("Line after AT:"+s.getBounds2D());
s=pageFlipAffineTransform.createTransformedShape(s);
s=pageRotateAffineTransform.createTransformedShape(s);
rect2 = s.getBounds2D();
LOG.debug("Line<After Transform>:" + rect2);
//Font Size
double wi=rect2.getWidth();
double he=rect2.getHeight();
double total=Math.sqrt(wi*wi+he*he);//This is done in case of rotation
long fntSizeinPt = Math.round(total);
LOG.debug("deciphered Font Size is:" + fntSizeinPt);


1. Bounding Box Issue: Seems like this is not consistent when I call font.getBoundingBox(). Is there a work around for this?

As already mentioned in a comment I could not exactly reproduce your observations as you described them: I get the overextended bounding boxes for the second PDF, too! And in both cases this is consistent with the font information in the PDF, the font descriptor FontBBox values are [-1475 -2463 2867 3117] and [-1474.60938 -2463.3789 2867.6758 3116.6992] respectively both of which are extremely large, the former seemingly a rounded version of the latter.

The third PDF you provided (and replaced the second with) uses two fonts. The one used for the actual "Test Line." characters has a font descriptor FontBBox value of [-19 -218 956 891] the values of which are more normal. As a consequence the blue frames drawn by DrawPrintTextLocations around those characters make more sense. The second font has a somewhat large FontBBox value, [-1462 -813 1723 1134], and the result are the two blue lines exceeding the blue frames: The only character from that other font used are zero width spaces, so the boxes around them also have a zero width...

Thus, still everything is consistent.

A possible work-around would be not to trust the information from the PDF representation of the font but instead inspect the embedded font program

1. getFontSizeInPts(): This method seems to be influenced by the bounding box. Am I right in thinking so?(As the Font Size in Pt is showing as 50)

No. You must be aware that in PDFs the scale of drawn text depends on a number of items:

• First there is the "font size" you set together with the font using the Tf instruction;
• then there is the text matrix which can scale this size up or down;
• then there is the current transformation matrix which again can scale this size up or down;
• and finally there is the page UserUnit value which can again scale this size up.

In your first document you have a font size of 1 which the text matrix scales up to 50 and the current transformation matrix then scales down again to 12 which the UserUnit default value leaves as is.

In your second and third document you have a font size of 16 which the text matrix leaves as is and the current transformation matrix scales down to 12, once again left as is by the UserUnit default.

The FontSizeInPts is a value you get after the second step (well, kind of, merely the top left entry of the text matrix is taken into account). As the situation in your documents shows, it essentially is a mere intermediate result of no further interest. Furthermore, the bounding box has not part in its calculation.

1. What is the way to get FontSize in points?

IMO you should take a vertical line as long as the font size value, apply the text matrix and the current transformation matrix, take the length of the resulting line and multiply that by the page UserUnit value.

The TextMatrix value of the TextPosition already combines a number of those steps; in spite of its name it is not the text matrix as specified in the PDF specification but more, cf. its documentation:

/**
* The matrix containing the starting text position and scaling. Despite the name, it is not the
* text matrix set by the "Tm" operator, it is really the effective text rendering matrix (which
* is dependent on the current transformation matrix (set by the "cm" operator), the text matrix
* (set by the "Tm" operator), the font size (set by the "Tf" operator) and the page cropbox).
*
* @return The Matrix containing the starting text position
*/
public Matrix getTextMatrix()


Thus, if m is that Matrix, you merely need to apply it to the points (0, 0) and (0, 1), apply the matrix to them, measure the distance of the resulting points, and multiply that distance by the page UserUnit value (which very often is 1).

Question:

I want to create a PDF file conforming to the PDF/A standard by using Apache PDFBox. To conform to PDF/A, all used fonts have to be embedded. I can use either the standard fonts or load one from a file, but I need to adjust the character width of several glyphs. I can do this by loading a font (or using a standard font) and modify it afterwards, as shown below.

doc = new PDDocument();
PDPage page = new PDPage();
InputStream fontStream = PDFCreator.class.getResourceAsStream("ArialMT.ttf");
List<Float> test = font.getWidths();
test.set(101-32, 2000f);
font.setWidths(test);


But how can I embed the modified font?

Embedding the original font with a patched Widths font dictionary entry

If you use your manipulated font, it will be embedded. If you e.g. continue your code like this:

PDPageContentStream stream = new PDPageContentStream(doc, page);
stream.setFont(font, 12);
stream.beginText();
stream.moveTextPositionByAmount(30, 600);
stream.drawString("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
stream.moveTextPositionByAmount(0, -20);
stream.drawString("abcdefghijklmnopqrstuvwxyz");
stream.moveTextPositionByAmount(0, -20);
stream.drawString("0123456789");
stream.endText();
stream.close();

doc.save("embedFont.pdf");


you get a PDF like this using PDFBox 1.8.8:

As you see, your manipulation of the width of 'e'

test.set(101-32, 2000f);


makes the space for that letter fairly broad.

If you look into the PDF, you will find this Widths array in the font dictionary:

/Widths [278.0 278.0 355.0 556.0 556.0 889.0 667.0 191.0 333.0 333.0
389.0 584.0 278.0 333.0 278.0 278.0 556.0 556.0 556.0 556.0
556.0 556.0 556.0 556.0 556.0 556.0 278.0 278.0 584.0 584.0
584.0 556.0 1015.0 667.0 667.0 722.0 722.0 667.0 611.0 778.0
722.0 278.0 500.0 667.0 556.0 833.0 722.0 778.0 667.0 778.0
722.0 667.0 611.0 722.0 667.0 944.0 667.0 667.0 611.0 278.0
278.0 278.0 469.0 556.0 333.0 556.0 556.0 500.0 556.0 2000.0
...


Your 2000 is there alright. As far as the PDF is concerned, your change has been stored.

Admittedly, though, the width of 'e' in the embedded font program is not changed. If you want to change that, you should pre-process the font file with the widths adjusted:

Embedding a patched font

You can use e.g. Google's sfntly to patch the font on the fly. In that case the analog code could look like this:

byte[] fontBytes = null;
try (   InputStream arialMtResource = getClass().getResourceAsStream("ArialMT.ttf");
ByteArrayOutputStream baos = new ByteArrayOutputStream()   )
{
fontBytes = baos.toByteArray();
}

try (   ByteArrayInputStream fontStream = new ByteArrayInputStream(fontBytes);   )
{
PDDocument doc = new PDDocument();
PDPage page = new PDPage();
PDPageContentStream stream = new PDPageContentStream(doc, page);
stream.setFont(font, 12);
stream.beginText();
stream.moveTextPositionByAmount(30, 600);
stream.drawString("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
stream.moveTextPositionByAmount(0, -20);
stream.drawString("abcdefghijklmnopqrstuvwxyz");
stream.moveTextPositionByAmount(0, -20);
stream.drawString("0123456789");
stream.endText();
stream.close();

doc.save("target/test-outputs/embedPatchedFont.pdf");
}


making use of this helper method:

void patchAdvanceWidth(InputStream is, OutputStream os, int entry, int newValue) throws IOException
{
FontFactory fontFactory = FontFactory.getInstance();
Builder builder = builders[0];

HorizontalMetricsTable.Builder hmtxBuilder = (HorizontalMetricsTable.Builder) builder.getTableBuilder(Tag.hmtx);
WritableFontData hmtxData = hmtxBuilder.data();

int offset = 0 + (entry * 4) + 0;
hmtxData.writeUShort(offset, newValue);
hmtxBuilder.setData(hmtxData);

Font font = builder.build();
fontFactory.serializeFont(font, os);
}


Question:

I am converting a page in a PDF document to bytes and then constructing an image out of it.

On Windows, the image is constructed fine. On Linux, the letters on the image look smudged (overlap each other)

In the logs (weblogic), i see the following indicating the fonts required are missing on Linux.

<Dec 3, 2019 11:06:35 PM EST> <Warning> <org.apache.pdfbox.pdmodel.font.PDType1Font> <BEA-000000> <Using fallback font LiberationSans for Helvetica-Bold>
<Dec 3, 2019 11:06:35 PM EST> <Warning> <org.apache.pdfbox.pdmodel.font.PDType1Font> <BEA-000000> <Using fallback font LiberationSans for Times-Roman>
<Dec 3, 2019 11:06:35 PM EST> <Warning> <org.apache.pdfbox.pdmodel.font.PDType1Font> <BEA-000000> <Using fallback font LiberationSans for Times-Bold>
<Dec 3, 2019 11:06:35 PM EST> <Warning> <org.apache.pdfbox.pdmodel.font.PDType1Font> <BEA-000000> <Using fallback font LiberationSans for Times-Italic>
<Dec 3, 2019 11:06:35 PM EST> <Warning> <org.apache.pdfbox.pdmodel.font.PDType1Font> <BEA-000000> <Using fallback font LiberationSans for Helvetica>


How can supply the missing fonts on Linux? I see references to using a properties file (PDFBox_External_Fonts.properties) on versions before 2. What can i do on pdfbox version 2.0.17? I am unable to find any documentation on how to proceed.

Linux : org.apache.fontbox.util.autodetect.UnixFontDirFinder.java Windows : org.apache.fontbox.util.autodetect.WindowsFontsDirFinder.Java the PDFBox load the system's fonts by above classes. you can check the sources. Solution 1 : you can add the missing fonts to any Dir, then add find Dir in above classes Solution 2 : as your metioned Tilman Hausher's solution.

one more thing : when PDFBox first load all fonts in system. then create a file named .pdfbox.cache. if you want PDFBox reload fonts or load your new added fonts , you need to delete that file first. please let me know if any concern.

Question:

I am using PDFBox to extract text from several PDF docs, and whilst running my unit test suite (via gradle) I am getting intermittent failures caused by a NullPointerException - my base assumption now being that it is caused by multiple threads attempting to load the font into the font dictionanry cache at the same time.

I know, as is stated in the FAQs, that PDFBox is not threadsafe - but the impression I have got from that and this discussion here, is that relates specifically to multiple threads accessing a document at the same time, and the comment appears to suggest that the fontbox cache is expected to be threadsafe.

The exception I am getting in my unit test is:

WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldItalicMT'
java.lang.NullPointerException:
at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFont(FontMapperImpl.java:463)
at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:417)
at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getTrueTypeFont(FontMapperImpl.java:321)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:123)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
...
Oct 03, 2016 12:21:24 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldMT'
Oct 03, 2016 12:21:24 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>


I am using PDFBox version 2.0.2

Anyone come across this before?

This has been fixed in the PDFBox library from version 2.0.4

Details in the original ticket here: https://issues.apache.org/jira/browse/PDFBOX-3521

Question:

I am trying to get the font colour from PDFBox and I seem to keep throwing an exception. Can someone help? The way I tried to obtain the colour was (page is the PDPage I obtained):

PDResources = page.getResources();
Iterable<COSName> fontNames = resources.getFontNames();
for (COSName fontName:fontNames)
System.out.println("name: " + resources.getFont(fontName).getName() +
"colour: " + resources.getColorSpace(fontName).getName());


This prints out the exception:

org.apache.pdfbox.pdmodel.MissingResourceException: Missing color space: F1


Could someone tell me how to properly get the colour of a font obtained in this manner?

/**
* This is an example on how to get the colors of text. Note that this will not tell the background,
* and will only work properly if the text is not overwritten later, and only if the text rendering
* modes are 0, 1 or 2. In the PDF 32000 specification, please read 9.3.6 "Text Rendering Mode" to
* know more. Mode 0 (FILL) is the default. Mode 1 (STROKE) will make glyphs look "hollow". Mode 2
* (FILL_STROKE) will make glyphs look "fat".
*
* @author Ben Litchfield
* @author Tilman Hausherr
*/
public class PrintTextColors extends PDFTextStripper
{
/**
* Instantiate a new PDFTextStripper object.
*
*/
public PrintTextColors() throws IOException
{
}

/**
* This will print the documents data.
*
* @param args The command line arguments.
*
* @throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException
{
if (args.length != 1)
{
usage();
}
else
{
PDDocument document = null;
try
{

PDFTextStripper stripper = new PrintTextColors();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
}
finally
{
if (document != null)
{
document.close();
}
}
}
}

@Override
protected void processTextPosition(TextPosition text)
{
super.processTextPosition(text);

PDColor strokingColor = getGraphicsState().getStrokingColor();
PDColor nonStrokingColor = getGraphicsState().getNonStrokingColor();
String unicode = text.getUnicode();
RenderingMode renderingMode = getGraphicsState().getTextState().getRenderingMode();
System.out.println("Unicode:            " + unicode);
System.out.println("Rendering mode:     " + renderingMode);
System.out.println("Stroking color:     " + strokingColor);
System.out.println("Non-Stroking color: " + nonStrokingColor);
System.out.println("Non-Stroking color: " + nonStrokingColor);
System.out.println();

// See the PrintTextLocations for more attributes
}

/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println("Usage: java " + PrintTextColors.class.getName() + " <input-pdf>");
}
}