Hot questions for Using PDFBox in scala

Question:

I have a pdf file in devanagiri. Some of the glyphs are being mapped in an incorrect manner. I want to extract all these glyphs from a pdf file and map them to correct unicode. How do i extract the glyphs of a pdf file?

https://1drv.ms/b/s!AmHcFaD-gMGyhipy6feWmHK7Ea-P


Answer:

The OP clarified in a comment that he essentially wants the glyph IDs instead of the characters they are mapped to by ToUnicode. As the font in question has an Identity-H encoding, the glyph IDs coincide with the character codes.

The character codes of the text glyphs are contained in the TextPosition objects processed by the text stripper. Thus, you have to add your own code to the stripper in a method which still has these TextPosition objects.

The final method for which this is true is writeString(String, List<TextPosition>) which by default ignores the second parameter and calls writeString(String) with the first character.

You in contrast must not ignore the second parameter but inspect it, e.g. like this:

PDDocument document = PDDocument.load(resource);
PDFTextStripper stripper = new PDFTextStripper() {
    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
        for (TextPosition textPosition : textPositions) {
            writeString(String.format("%s%s", textPosition.getUnicode(), Arrays.toString(textPosition.getCharacterCodes())));
        }
    }
};
//stripper.setSortByPosition(true);
String text = stripper.getText(document);

System.out.printf("\n*\n* singNepChar.pdf\n*\n%s\n", text);

(ExtractCharacterCodes test testExtractFromSingNepChar)

This example only outputs each extracted character alongside the character code it was extracted from. You can instead do any evaluation of the given data, e.g. a mapping to Unicode based on the character code and additional information you may have.

You actually have much more information at your hand, the TextPosition in particular also contains the font object (via getFont) of the text. As the character codes may differ from font to font, this information might become important to you.

In case of your sample document the output is

*
* singNepChar.pdf
*
क[1399] [3]ख[1400] [3]ग[1401] [3]घ[1402] [3]ङ[1403] [3]च[1404] [3]छ[1405] [3]ज[1406] [3]झ[1407] [3]ञ[1408] [3]ट[1409] [3]ठ[1410] [3]ड[1411] [3]ढ[1412] [3]ण[1413] [3]त[1414] [3]थ[1415] [3]द[1416] [3]ध[1417] [3]न[1418] [3]प[1420] [3]फ[1421] [3]ब[1422] [3]भ[1423] [3]म[1424] [3]य[1425] [3]र[1426] [3]ल[1428] [3]व[1431] [3]श[1432] [3]ष[1433] [3]स[1434] [3]ह[1435] [3]क्ष[6979] [3]त्र[7074] [3]ज्ञ[6980] [32]
ऄ[1383] [3]अ[1384] [3]आ[1385] [3]इ[1386] [3]ई[1387] [3]उ[1388] [3]ऊ[1389] [3]ए[1393] [3]ऐ[1394] [3] [3]ओ[1397] [3]औ[1398] [32]ऄ[1383]ं[1381] [3]ऄ[1383]ः[1382] [32]
 [32]
 [32]
 [32]
 [32]
 [32]
 [32]
 [32]
 [32]

(Beware, my outputs are decimal while the data in your comments are hexadecimal.)

Question:

I'm having trouble understanding the behavior of PDFBox when attempting to append text in a page's content stream. I am using a sample scanned PDF which is just a raster image overlayed on the page. My working knowledge of PDF internals is somewhat basic, so I may be on the wrong track.

http://solutions.weblite.ca/pdfocrx/scansmpl.pdf

I am using PDFBox 2.0.11 with sbt: "org.apache.pdfbox" % "pdfbox" % "2.0.11"

My first step is to create a content stream and write "hello world" on the PDF, which I accomplished with the following:

// val pdf: PDDocument
val page = pdf.getPages(0)
val contentStream = new PDPageContentStream(pdf, page, false, true)
contentStream.beginText()
contentStream.newLineAtOffset(0, 0)
contentStream.setFont(PDType1Font.COURIER, 12)
contentStream.showText("Hello, world!")
contentStream.endText()
contentStream.close()

This works, and the text shows up in the bottom left, which is where I expected it to be. But it of course overwrites the raster image, which is not what I want. So, I change the PDPageContentStream constructor to (pdf, page, true, true) to make it append to the content stream.

Now I get bizarre behavior that I don't understand. The text shows up huge. So big that I can only see the bottom corner of the H because it is at least 10x larger than the page itself. I guess this means there's some dangling matrix transformation that is occurring? I'm not sure that I fully understand how the transformation operations work within a PDF. PDFBox seems to imply that calling setTextMatrix replaces the existing matrix with the new one, rather than it being relative to the existing text matrix. I can get the text to be visible (and close to normal size) with this:

val affine = new AffineTransform()
affine.setToIdentity()
affine.scale(0.002, 0.002)
// code
contentStream.setTextMatrix(new Matrix(affine))

Which I only discovered through trial and error. I don't see anyway to get the current transformation matrix state other than the page-wide .getMatrix(), but that appears to return the identity regardless of whether I'm appending or overwriting, so I don't think it is that. Additionally, if I apply another text matrix with the exact same call as the last line in the previous block, it appears to scale it relative to the previous scale, so I end up with a second text block that is scaled too small to see.

How can I get the current transformation matrix so that I can invert it to reach the actual desired scaling?

Thanks!


Answer:

It appears that this was the issue. I did not see the constructor with the 5th argument for resetContext before. I'm still unsure how you would get the current context if you for some reason needed to do something relative to that context, though. In my case, adding the 5th argument solves the problem.

PDFBox : PDPageContentStream's append mode misbehaving