Hot questions for Using PDFBox in string

Question:

When I try to write illegal characters to a PDF I obviously get an exception. E.g.

contentStream.showText("some illegal characters");    
...
java.lang.IllegalArgumentException: U+000A ('controlLF') is not available in this font Helvetica (generic: ArialMT) encoding: WinAnsiEncoding...

How can I find out which characters are not supported and strip them from the string?


Answer:

Here is my solution... at least it works for what I need. I used the WinAnsiEncoding class of PDFBox and called the contains method to check if the character is supported.

import org.apache.pdfbox.pdmodel.font.encoding.WinAnsiEncoding;

public class Test extends WinAnsiEncoding {

    public static String remove(String test) {
        StringBuilder b = new StringBuilder();
        for (int i = 0; i < test.length(); i++) {
            if (WinAnsiEncoding.INSTANCE.contains(test.charAt(i))) {
                b.append(test.charAt(i));
            }
        }
        return b.toString();
    }

    public static void main(String[] args) {
        System.out.println(remove("abc\rcde"));
        // prints abccde
    }

}

Question:

I want to use Apache PDFBox 1.8.8 to create a PDF that contains unicode characters but I'm confused about what is supported and what isn't.

An answer posted here suggests it is a bug that has been fixed on the trunk.

Another answer posted here suggests that I have to do the translation myself.

And another (older) answer posted here talks about embedding fonts.

Please can someone clarify. Also, if it was a bug that is now fixed, can someone tell me when the next release of PDFBox is likely to be.

Thanks.


Answer:

Essentially all the answers you linked to are correct. You have to keep in mind which PDFBox version they respectively refer to.

concerning this answer:

In the pre-2.0.0 versions (up to the current 1.8.8) the text drawing operations were very limited and didn't support even the full WinAnsi encoding which font objects generated by these versions used as encoding.

concerning this answer:

The current 2.0.0-SNAPSHOT development state has much improved. This means that the limitations of the text drawing operations have been removed, they properly encode the text and the used fonts are properly encoded and embedded. Bugs in the early implementations of these improvements meanwhile have mostly been fixed.

concerning this answer:

This answer points to something one needs to keep in mind, no matter which PDFBox version one uses: specific fonts do not necessarily support the whole Unicode range of code points. If the font you use does not contain a glyph definition for a character, you can encode as much as you want, your character won't be drawn properly. This especially concerns the standard 14 fonts which every PDF viewer has to support: they need only support characters from a few Latin-style encodings, by far not the the full Unicode set.

Question:

In my implementation of PDFBox, I have created methods to write strings in multiple languages by testing out different fonts.

PDFont currentFont = PDType0Font.load(pdfDocument, new File("path/to/font/font.ttf"));
for (int offset = 0; offset < sValue.length();) {
    int iCodePoint = sValue.codePointAt(offset);
    boolean isEncodable = isCodePointEncodable(currentFont, iCodePoint);
    //-Further logic here, etc.

    offset += Character.charCount(iCodePoint);
}

private boolean isCodePointEncodable (PDFont currentFont, int iCodePoint) throws IOException {
    StringBuilder st = new StringBuilder();
    st.appendCodePoint(iCodePoint);
    try {
        currentFont.encode(st.toString());
        return true;
    } catch (IllegalArgumentException iae) {
        return false;
    }
}

While this works fine for anything in the Basic Multilingual Plane (BMP), anything that involves unicodes beyond the BMP will not work. I have downloaded and looked at the involved fonts extensively with glyph charts, and have logged each code. For instance, when attempting to encode ๐Ÿš, which is U+1F681 (or decimal 128641), I tracked the logging and found it specifically attempted to encode this character in NotoEmoji-Regular.ttf, which is the correct matching one, and does indeed have this character. Unfortunately, it still returned false.

Specifically, my logging server returned this:

Code Point 128641 (๐Ÿš) cannot be encoded in font NotoEmoji

Are there any workarounds or solutions for this? Thank you.


Answer:

I have created and resolved issue PDFBOX-3997. The cause was that we didn't use the best possible cmap subtable.

There is no workaround but the bug will be fixed in version 2.0.9, coming in a few months. But you don't have to wait that long - you can test with a snapshot build.

Question:

I try to parse the Content Stream of a PDF using PDFBox 2.0.0. Here is a part of the code that handle it :

InputStream is;
try {
    is = this.input.getDocumentCatalog().getPages().get(page).getContents();
} catch (IOException e) {
    e.printStackTrace();
    return;
}
BufferedReader br = new BufferedReader(new InputStreamReader(is));

String line;
do {
    try {
        line = br.readLine();
    } catch (IOException e) {
        e.printStackTrace();
        try {
            br.close();
        } catch (IOException e1) {
            e1.printStackTrace();
        }
        return;
    }
    if(line != null){
        System.out.println(line);
    }
}while(line != null);

The problem is when I reach a "(someString) Tj" line : here an example of the output my code return :

BT
/F2 7.0866 Tf
0 Tr
7.0866 TL
0.001 Tc
65 Tz
0 0 Td
(
ET

As you can see, the "(someString) Tj" line became "(" ... In eclipse's debug mode, when the programme reach this line, the "line" variable contain the following value :

"(

(with a " at the beginning and nothing behind the '(', unlike any other string that terminate with a second "). If I expend the String value, I get the following array of char :

[0] (   
[1] 
[2] %   
[3] 
[4] $   
[5] 
[6] 
[7] 
[8] 
[9] )
[10]T   
[11]j   

Some of the empty cases return a "void" value (which raise a "Generated value (void) is not compatible with declared type (char)" error in eclipse), other contain some un-understandable characters. I think the problem come from a bad character encoding but I cant find a solution.

I have already tried some things like

line = new String(br.readLine().getBytes("UTF-8"), "UTF-8");

or so, but since I'm not really sure what the problem is, it's really hard to solve it.

Can someone explain to me what the problem is and eventually how to solve it please ?

Thanks for your helps.


Answer:

The problem

Can someone explain to me what the problem is

The problem is that you try to treat the content stream as if it consists of pure textual data in some single standard encoding.

This is wrong.

While indeed the operators and numeric parameters are given in an ASCII'ish form, the content of string parameters of text showing operators may be encoded in ways that are completely unlike ASCII'ish data (let alone UTF-8-encoded ones).

To quote the specification:

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the fontโ€™s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".

(section 9.4.3 Text-Showing Operators of ISO 32000-1)

If standard encodings are used, these font-specific encodings may remind of ASCII or Latin-1 or similar encodings, but especially in case of partially embedded fonts you often find ad-hoc encodings without any relation to known encodings.

Thus, to properly parse content streams, you have to treat them as binary data and interpret the string operands according to the encoding of the current font at that very position in the content stream.

A solution

how to solve it

In PDFBox there are classes that already interpret content streams and try to find Unicode string representations for drawn text.

You, therefore, may want to look at

  • the PDFTextStripper class, which is the basic PDFBox text extraction class;
  • the classes derived from PDFTextStripper which present special text extraction problem solutions, e.g. for extraction of text from a given area on the page;
  • the classes PDFTextStripper is derived from, which present a generic content stream parsing framework; and
  • the PDFBox example classes focusing on all of the above which illuistrate their usage.

From a follow-up comment of the OP:

I choose this approach to extract the PDF's content because what I want to extract isn't some text but vector-made schemas. The text I try to extract in this particular problem is the variables that are link to specifics parts of the schema. That's why I can't really use 'PDFTextStripper', since I need global information on the vectors that are around the text I extract. But maybe my approach is wrong from the beginning ...

To properly parse those texts, you do have to do something similar to what the text stripper does, and I would propose not to reinvent the wheel.

PDFTextStripper extends the class PDFTextStreamEngine which in turn extends PDFStreamEngine.

PDFStreamEngine is a class which processes a PDF content stream and executes certain operations; it provides a callback interface for clients that want to do things with the stream.

PDFTextStreamEngine is the PDFStreamEngine subclass for advanced processing of text via TextPosition.

You might want to extend one of the latter two classes for your task and create and register callbacks for vector graphics operations. These callbacks can collect the vector graphics operations you need. The parallel callbacks for textual data provide the variables that are link to specifics parts.

The use of these classes may introduce a certain amount of complexity and you'll will have to study them a bit, but as soon as you have understood their inner workings, they quite likely will turn out to be exactly the base you need.

Question:

I use Apache PDFBox to parse text from pdf file. I tried to get a line after a specific line.

PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    System.out.println("Text from pdf:" + text);
} else{
    log.info("File is encrypted!");
}
document.close();

Sample:

Sentence 1, nth line of file

Needed line

Sentence 3, n+2th line of file

I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox based solution. Solution 1:

String[] lines = myString.split(System.getProperty("line.separator"));

Solution 2:

String neededline = (String) FileUtils.readLines(file).get("n+2th")

Answer:

In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.

You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().

Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:

First, you captures lines in an array like you said.

import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class Main {

    static String[] lines;

    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("my2.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);
        lines = text.split(System.getProperty("line.separator"));
        document.close();
    }
}

Here's a method to get a complete String by any line number index, easy:

// returns a full String line by number n
static String getLine(int n) {
    return lines[n];
}

Here's a linear search method that finds a string match and returns the first line number where found.

// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
    int n = 0;
    for(String line : lines) {
        if(line.indexOf(filter) != -1) {
            return n;
        }
        n++;
    }
    return -1;
}

With the above, it possible to get only the line number for your matched search:

System.out.println(getLine(8)); // line 8 for example

Or, the entire String line that contains your matched search:

System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);

This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.

Here's a link to a tutorial that also does what you are trying to do:

https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/

Again, same approach...

Question:

I am using PdfBox's .net to parse to extract text from a pdf alongwith text location.For that, while searching I found the following java code:

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        super.writeString(text, textPositions);

        TextPosition firstProsition = textPositions.get(0);
        TextPosition lastPosition = textPositions.get(textPositions.size() - 1);
        writeString(String.format("[%s - %s / %s]", firstProsition.getXDirAdj(), lastPosition.getXDirAdj() + lastPosition.getWidthDirAdj(), firstProsition.getYDirAdj()));
    }
};
stripper.setSortByPosition(true);
return stripper.getText(document);

I converted it to .net in the following way:

class PDFTextLocationStripper : PDFTextStripper
{
    public string textWithPostion = "";
    protected override void processTextPosition(TextPosition text)
    {
            textWithPostion += "String[" + text.getXDirAdj() + "," +
            text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
            text.getXScale() + " height=" + text.getHeightDir() + " space=" +
            text.getWidthOfSpace() + " width=" +
            text.getWidthDirAdj() + "]" + text.getCharacter();
    }

    protected override void writeString(java.lang.String text, java.util.List textPositions) 
    {
            base.writeString(text, textPositions);
            TextPosition firstProsition = (TextPosition)textPositions.get(0);
            TextPosition lastPosition =(TextPosition) textPositions.get(textPositions.size() - 1);
            writeString(String.Format("[%s - %s / %s]", firstProsition.getXDirAdj(), lastPosition.getXDirAdj() + lastPosition.getWidthDirAdj(), firstProsition.getYDirAdj()));
    }

}

But, I get compilation error for the above code that :

Error 1 No overload for method 'writeString' takes 2 arguments

Error 2 'PDFTextLocationStripper.writeString(java.lang.String, java.util.List)': no suitable method found to override

So ,how do I override writeString method so that I can extract text along with location?


Answer:

Since, I wasn't able to overload the writeString method.I used the processTextPosition to extract words from a pdf along with their positions.Here is the code:

class PDFTextLocationStripper : PDFTextStripper
    {
        public string textWithPostion = "";
        public Dictionary<float, Dictionary<float, PdfWord>> pdfWordsByXByY;

        public PDFTextLocationStripper(): base()
        {
            try
            {
                textWithPostion = "";
                pdfWordsByXByY = new Dictionary<float, Dictionary<float, PdfWord>>();
            }
            catch (Exception ex)
            {

            }
        }

        protected override void processTextPosition(TextPosition text)
        {
            try
            {
                float textX = text.getXDirAdj();
                float textY = text.getYDirAdj();
                if (!String.IsNullOrWhiteSpace(text.getCharacter()))
                {
                    if (pdfWordsByXByY.ContainsKey(textY))
                    {
                        Dictionary<float, PdfWord> wordsByX = pdfWordsByXByY[textY];
                        if (wordsByX.ContainsKey(textX))
                        {
                            PdfWord word = wordsByX[textX];
                            wordsByX.Remove(word.Right);
                            word.EndCharWidth = text.getWidthDirAdj();
                            word.Height = text.getHeightDir();
                            word.EndX = textX;
                            word.Text += text.getCharacter();
                            if (!wordsByX.Keys.Contains(word.Right))
                            {
                                wordsByX.Add(word.Right, word);
                            }
                        }
                        else
                        {
                            float requiredX = -1;
                            float minDiff = float.MaxValue;
                            for (int index = 0; index < wordsByX.Keys.Count; index++)
                            {
                                float key = wordsByX.Keys.ElementAt(index);
                                float diff = key - textX;
                                if (diff < 0)
                                {
                                    diff = -diff;
                                }
                                if (diff < minDiff)
                                {
                                    minDiff = diff;
                                    requiredX = key;
                                }
                            }
                            if (requiredX > -1 && minDiff <= 1)
                            {
                                PdfWord word = wordsByX[requiredX];
                                wordsByX.Remove(requiredX);
                                word.EndCharWidth = text.getWidthDirAdj();
                                word.Height = text.getHeightDir();
                                word.EndX = textX;
                                word.Text += text.getCharacter();
                                if (!wordsByX.ContainsKey(word.Right))
                                {
                                    wordsByX.Add(word.Right, word);
                                }
                            }
                            else
                            {
                                PdfWord word = new PdfWord();
                                word.Text = text.getCharacter();
                                word.EndX = word.StartX = textX;
                                word.Y = textY;
                                word.EndCharWidth = word.StartCharWidth = text.getWidthDirAdj();
                                word.Height = text.getHeightDir();
                                if (!wordsByX.ContainsKey(word.Right))
                                {
                                    wordsByX.Add(word.Right, word);
                                }
                                pdfWordsByXByY[textY] = wordsByX;
                            }
                        }
                    }
                    else
                    {
                        Dictionary<float, PdfWord> wordsByX = new Dictionary<float, PdfWord>();
                        PdfWord word = new PdfWord();
                        word.Text = text.getCharacter();
                        word.EndX = word.StartX = textX;
                        word.Y = textY;
                        word.EndCharWidth = word.StartCharWidth = text.getWidthDirAdj();
                        word.Height = text.getHeightDir();
                        wordsByX.Add(word.Right, word);
                        pdfWordsByXByY.Add(textY, wordsByX);
                    }
                }
            }
            catch (Exception ex)
            {

            }
        }
    }

And here is the PdfWord class.

 class PdfWord
    {
        public float StartX { get; set; }
        public float EndX { get; set; }
        public float Y { get; set; }
        public float StartCharWidth { get; set; }
        public float EndCharWidth { get; set; }
        public float Height { get; set; }
        public string Text { get; set; }
        public float Right { get { return EndX + EndCharWidth; } }
    }

Question:

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.

Sample unicode text

Bษษ‘ษ’

issue (Address column)

https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing

I am using PDFBox. Please help me with this.....

check the code I am using.....

    enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
        throws IOException {
    if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
        return document;
    }

    for (PDPage page : document.getPages()) {

        PDResources resources = new PDResources();
        PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
        //PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
        resources.add(font);
        //resources.add(font2);
        //resources.add(PDType1Font.TIMES_ROMAN);
        page.setResources(resources);
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List tokens = parser.getTokens();

        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;

                String pstring = "";
                int prej = 0;

                // Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that
                    // operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();

                            if (j == prej) {
                                pstring += string;
                            } else {
                                prej = j;
                                pstring = string;
                            }
                        }
                    }

                    if (searchString.equals(pstring.trim())) {
                        COSString cosString2 = (COSString) previous.getObject(0);
                        cosString2.setValue(replacement.getBytes());

                        int total = previous.size() - 1;
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }
                    }
                }
            }
        }

        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        out.close();
        page.setContents(updatedStream);
    }

    return document;
}

Answer:

Your code utterly breaks the PDF, cf. the Adobe Preflight output:

The cause is obvious, your code

PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);

drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.

You must not drop existing resources as they are used in your document.


Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.

I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task

to read the strings from PDF file and replace it with the Unicode text

cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.


What you can implement instead is:

  • First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
  • Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
  • Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.

Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

Question:

I am creating PDF documents from user inputs that are UTF-8.

Beyond displaying the PDFs, the creation itself fails with java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding.

Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).

The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.

The issue is that I cannot find valid example to achieve this. All examples seem to be pointing at removed code (font.setToUnicodeor some method in encoding to convert characters one at a time).

So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).

Many thanks, JM


Answer:

I ended doing a character by character sanitization.

Here what my sanitization function looks like.

To avoid reprocessing characters, I am caching the availability of each character for each given font.

When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

There might be a lot of things to improve but this works for my use case.

private String getPrintableString(String string, PDFont font) {

    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < string.length(); i++) {

        int codePoint = string.codePointAt(i);

        if (codePoint == 0x000A) {
            sb.appendCodePoint(codePoint);
            continue;
        }

        String fontName = font.getName();
        int cpKey = fontName.hashCode();
        cpKey = 31 * cpKey + codePoint;

        if (codePointAvailCache.get(cpKey) == null) {

            try {
                font.encode(string.substring(i, i + 1));
                codePointAvailCache.put(cpKey, true);
            } catch (Exception e) {
                codePointAvailCache.put(cpKey, false);
            }
        }

        if (!codePointAvailCache.get(cpKey)) {

            // Need to make sure our font has a replacement character
            try {
                codePoint = 0xFFFD;
                font.encode(new String(new int[] { codePoint }, 0, 1));
            } catch (Exception e) {
                codePoint = 0x003F;
            }
        }

        sb.appendCodePoint(codePoint);
    }

    return sb.toString();
}

Question:

For example, a pdf file contains this.

Name: John Smith
Birth Date: December 21, 1990

Using Java with pdfbox, can anyone give me a simple code to put 'John Smith' on a variable name 'name' and 'December 21, 1990' to 'bdate'?


Answer:

As you have not shared a specific PDF, it is difficult to supply specific code. In general, though:

Text extraction

You can extract the text of a document like this:

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(resource);
String text = stripper.getText(document);

Now you can analyze the text like any other String.

Text extraction limitations

PDF is a format which is not primarily meant for automatic content processing , it originally is meant to be displayed identically on different output devices to a human. Thus, making the content available in an intelligible format to a program is not required, and numerous PDFs do not include the information required for text extraction short of OCR.