Hot questions for Using PDFBox in groovy

Question:

I have a PostScript Sample that illustrates creating a Form. If I convert the PostScript to PDF, I can enumerate the FormXObject quite easily but how do I get access to the content? For example

/SForm <<     
    /FormType 1              % all forms are FormType 1
    /Matrix [ 1 0 0 1 0 0]   % no scaling or translating
    /BBox [ 0 -10 100 100 ]  % hack - should really calculate the width of the string 
                             % and the height of the font allowing for descenders etc
    /PaintProc {
        pop
        0 0 moveto           % assume that the translate has set the current point
        (XObject String) show
        0 24 moveto
        (Line Two) show        
    } bind   
>> def

Translates to

7 0 obj
<</Type/XObject/Subtype/Form/FormType 1/BBox[0 -10 100 100]/Resources 6 0 R/Matrix[1 0 0 1 0 0]/Length 98>>
stream
/GS1 gs
BT
/F1 1 Tf
11 0 0 11 0 0 Tm
0 g
0 Tc
0 Tw
(XObject String)Tj
0 2.1818 TD
(Line Two)Tj
ET

endstream
endobj

How can I obtain the information between stream and endstream. I had assumed that this would have been a relatively simple operation but I've not managed to retrieve the content. If I use something like the following (in my Groovy Code) then I get the information between << >> (the dictionary) but not the actual PDF operators that do that actual marking (from the PostScript PaintProc).

 Iterable<COSName> names = pdDoc.getPage(pageNum).getResources().getXObjectNames();
 for (COSName name:names){
    def xObject = pdResources.getXObject(name)
    if (xObject instanceof PDFormXObject) {
       println xObject.getContentStream().dump()
    }
 }

Actually, it would suit my purpose to get the content between the BT and ET operators. The main focus is to find the "definition" of the FormXObject along with its content and not really to explore where the FormXObject is used in the page content.

Obviously I have overlooked something but what? Thanks in advance.


Answer:

xObject.getContents() returns you an InputStream from which you can read the stream contents.

Question:

I'm trying to do something fairly simple and read an i9 PDF form from an incoming FlowFile, parse the first and last name out of it into a JSON, then output the JSON to the outgoing FlowFile.

I found no official documentation on how to do this, but someone has written up several cookbooks on doing things in several scripting languages in NiFi here. It seems pretty straightforward and I'm pretty sure I'm doing what is written there, but I'm not even sure the PDF is being read at all. It simply passes the PDF unmodified out to REL_SUCCESS every time.

Link to sample PDF

import java.nio.charset.StandardCharsets
import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
import java.nio.charset.StandardCharsets
def flowFile = session.get()
flowFile = session.write(flowFile, { inputStream, outputStream ->
    try {
        //Load Flowfile contents
        PDDocument document = PDDocument.load(inputStream)
        PDFTextStripperByArea stripper = new PDFTextStripperByArea()
        //Get the first page
        List<PDPage> allPages = document.getDocumentCatalog().getAllPages()
        PDPage page = allPages.get(0)

    //Define the areas to search and add them as search regions
    stripper = new PDFTextStripperByArea()
    Rectangle lname = new Rectangle(25, 226, 240, 15)
    stripper.addRegion("lname", lname)
    Rectangle fname = new Rectangle(276, 226, 240, 15)
    stripper.addRegion("fname", fname)
    //Load the results into a JSON
    def boxMap = [:]
    stripper.setSortByPosition(true)
    stripper.extractRegions(page)
    regions = stripper.getRegions()
    for (String region : regions) {
        String box = stripper.getTextForRegion(region)
        boxMap.put(region, box)
    }
    Gson gson = new Gson()
    //Remove random noise from the output
    json = gson.toJson(boxMap, LinkedHashMap.class)
    json = json.replace('\\n', '')
    json = json.replace('\\r', '')
    json = json.replace(',"', ',\n"')
    //Overwrite flowfile contents with JSON
    outputStream.write(json.getBytes(StandardCharsets.UTF_8))
    } catch (Exception e){
        System.out.println(e.getMessage())
        session.transfer(flowFile, REL_FAILURE)
    }
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)

EDIT: Was able to confirm that the flowFile object is being read properly by subbing a txt file in. So the problem seems to be that the inputStream is never being handed off to the PDDocument or something is happening when it does. I edited the code to try reading it into a File object first but that resulted in an error:

FlowFileHandlingException: null is not known in this session

EDIT Edit: Solved by moving my try/catch. I don't seem to understand how that works, my code above has been edited and works properly.


Answer:

session.get can return null, so definitely add a line after that if(!flowFile) return. Also put the try/catch outside the session.write, that way you can put the session.transfer(flowFile, REL_SUCCESS) after the session.write (inside the try) and the catch can transfer to failure.

Also I can't tell from the code how the PDFTextStripperByArea works to get the info from the incoming document. It looks like all the document stuff is inside the try, so wouldn't be available to the PDFTextStripper (and isn't passed in).

None of these things explain why you're getting the original flow file on the success relationship, but maybe there's something I'm not seeing that would be magically fixed by the changes above :)

Also, if you use log.info() or log.error() rather than System.out.println, you will see the output in the NiFi logs (and for error it will post a bulletin to the processor and you can see the message if you hover over the top right corner (red square if bulletin is present) of the processor.

Question:

Im trying to using pdfbox and compile with groovy but I dont know

this is my code I use

groovyc main.groovy

but not works please help me

this.class.classLoader.rootLoader.addURL(
   new URL("/usr/share/groovy/lib/pdfbox-2.0.11.jar"))



import org.apache.pdfbox.util.Splitter
import org.apache.pdfbox.pdmodel.PDDocument

class Main {
    static void main(String[] args){

File pdfFile = new File(args[0])
PDDocument doc = new PDDocument().load(pdfFile)

Splitter splitter = new Splitter()
def count=0
splitter.split(doc).eachWithIndex{v,i->
  v.save(pdfFile.path[0..-5]+'_'+i.toString().padLeft(3,'0')+'.pdf')
  v.close()
}


}

}

Answer:

Change

import org.apache.pdfbox.util.Splitter

to

import org.apache.pdfbox.multipdf.Splitter

(see javadoc)

Also make sure to include the needed dependencies, i.e. fontbox and commons-log and possibly more.