Hot questions for Using PDFBox in android

Question:

I am drawing an image on one of the PDF page.. when I use PDPageContentStream stream = new PDPageContentStream(doc, page); to draw image, everything works fine.. see below image.

but when I use constructor PDPageContentStream(doc, page, true, true); to create PDPageContentStream and draw image, the newly added image gets inverted upside down..

not getting what's going wrong here..

PS. I am using library PdfBox-Android


Answer:

Use the constructor that has a fifth parameter, so to reset the graphic context.

public PDPageContentStream(PDDocument document, PDPage sourcePage, boolean appendContent, 
                            boolean compress, boolean resetContext) throws IOException

alternatively, save and restore the graphics state in the first content stream by calling

saveGraphicsState();
// ...
restoreGraphicsState();

Question:

I am trying to extract only the highlighted text on a PDF document. It works on PC but when I use it on Android it fails. PDFBox doesnt directly work on android so I am using Birdbrain2/PdfBox-Android for Android.

Here is the PC code that works

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.PDFTextStripperByArea;

public class ExtractHighlights {
    public static void main(String args[]) {        
        System.out.println(extractHighlights("sample.pdf"));
    }

    public static String extractHighlights(String fileName){
        String extractedText = "";
        try {
            PDDocument pddDocument = PDDocument.load(new File(fileName));
            List allPages = pddDocument.getDocumentCatalog().getAllPages();         
            for (int i = 0; i < allPages.size(); i++) {             
                PDPage page = (PDPage) allPages.get(i);
                List<PDAnnotation> la = page.getAnnotations();
                if (la.size() < 1) {
                    continue;
                }

                for (PDAnnotation pda : la) {
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);

                    PDRectangle rect = pda.getRectangle();
                    float x = rect.getLowerLeftX();
                    float y = rect.getUpperRightY();
                    float width = rect.getWidth();
                    float height = rect.getHeight();
                    int rotation = page.findRotation();
                    if (rotation == 0) {
                        PDRectangle pageSize = page.findMediaBox();
                        y = pageSize.getHeight() - y;
                    }

                    Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y,
                            width, height);
                    stripper.addRegion("0", awtRect);
                    stripper.extractRegions(page);  
                    String highlight = stripper.getTextForRegion("0").trim();
                    if(highlight.length() == 0) continue;
                    extractedText += highlight.substring(0,highlight.length()-2)+" ";
                }               
            }
            pddDocument.close();
            //System.out.println(extractedText);
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        return extractedText;
    }
}

Here is the android code that doesnt work

@Override
        protected String doInBackground(String... strings) {
            String extractedText = "";

            try {
                Log.i("ExtractHighlights","Started");
                PDDocument pddDocument = PDDocument.load(new File(strings[0]));
                PDPageTree allPages = pddDocument.getDocumentCatalog().getPages();
                int totalPages = allPages.getCount();
                int pageNumber = 0;
                for (PDPage page:allPages) {
                    publishProgress(pageNumber++,totalPages);
                    Log.i("ExtractHighlights", "Reading page");
                    List<PDAnnotation> la = page.getAnnotations();
                    if (la.size() < 1) {
                        continue;
                    }

                    for (PDAnnotation pda : la) {
                        Log.i("ExtractHighlights","Annotation found");
                        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                        stripper.setSortByPosition(true);

                        Log.i("ExtractHighlights","Getting rectangle");
                        PDRectangle rect = pda.getRectangle();
                        float x = rect.getLowerLeftX();
                        float y = rect.getUpperRightY();
                        float width = rect.getWidth();
                        float height = rect.getHeight();

                        RectF region = new RectF(x,y,width,height);
                        stripper.addRegion("0",region);
                        Log.i("ExtractHighlights","Extracting regions");
                        stripper.extractRegions(page);
                        Log.i("ExtractHighlights","Getting text from region");
                        String highlight = stripper.getTextForRegion("0").trim();
                        Log.i("ExtractHighlights",highlight);
                        if(highlight.length() == 0) continue;
                        extractedText += highlight.substring(0,highlight.length()-2)+" ";
                    }
                    Log.i("ExtractHighlights","Page done");
                }
                pddDocument.close();
                Log.i("ExtractHighlights","Document closed");
            } catch (Exception ex) {
                ex.printStackTrace();
            }

            return extractedText;
        }

Also it takes a very long time on Android and Android assumes that the program has crashed.

I can try converting the entire PDF to text but then how will I know which text are highlighted?


Answer:

As @mkl pointed out, it is a known bug. So I used this guy's PDFBox port. It worked!

Question:

Using Android PDFBox API I am extracting page thumbnails to show on a PDF page picker component.

public Bitmap getPdfPageThumb (int pageIndex) {
    try {
        PDRectangle pageBox = pdfDoc.getPage(pageIndex).getBBox();
        float targetDpi = Math.max(
                targetWidth * 72f / pageBox.getWidth(),
                targetHeight * 72f / pageBox.getHeight());
        return renderer.renderImageWithDPI(pageIndex, targetDpi);
    }
    catch (Exception e) {
        return null;
    }
}

The resulting PDFs have ugly artefacts which look like misplaces shape points - see below. Is there a way to avoid this?

Thanks


Answer:

As mentioned in the comment this is a PDFBox bug in Android, solution is to use the PdfiumAndroid library which renders much nicer pages.