How can I check if PDF page is image(scanned) by PDFBOX, XPDF -

pdfbox problem on extract images. hi, how can check if pdf page image , extract pdfbox library, there method images if pdf page image not getting. 1 me solve problem.

xpdf problem on extract images. try extract images library xpdf strange flip on page if image. if pdf contain small image object image give me ok, if page scanned doing flip.

i want extract images pdf, if page scanned them image, if page contain plain text , images images page.

my point extract images pdf. not form page if page image extract them image not skip them how doing think pdfbox.

xpdf doing thing there problem flip(top,right) on page when export scanned page

how can solve problem thanks.

download file example test

    `pddocument document = pddocument.load(new file("/home/dru/ideaprojects2/pdfextractor/test/t1.pdf"));     pdpagetree list = document.getpages();      (pdpage page : list) {         pdresources pdresources = page.getresources();         system.out.println(pdresources.getresourcecache());          (cosname c : pdresources.getxobjectnames()) {             pdxobject o = pdresources.getxobject(c);              if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.pdimagexobject) {                 file file = new file("/home/dru/ideaprojects2/pdfextractor/test/out/" + system.nanotime() + ".png");                 imageio.write(((org.apache.pdfbox.pdmodel.graphics.image.pdimagexobject)o).getimage(), "png", file);             }         }     }`

extract images properly

as updated pdf makes clear problem not have images immediately on page has form xobjects drawn onto contain images. thus, image search has recurse form xobjects.

and not all: pages in updated pdf share same resources dictionary, merely pick different of form xobjects display. thus, 1 has parse respective page content stream determine xobject (with images) present on given page.

actually pdfbox tool extractimages does. unfortunately, though, not show page found image in question on, cf. extractimages.java test method testextractpageimagestool10948new.

but can borrow technique used tool:

pddocument document = pddocument.load(resource); int page = 1; (final pdpage pdpage : document.getpages()) {     final int currentpage = page;     pdfgraphicsstreamengine pdfgraphicsstreamengine = new pdfgraphicsstreamengine(pdpage)     {         int index = 0;          @override         public void drawimage(pdimage pdimage) throws ioexception         {             if (pdimage instanceof pdimagexobject)             {                 pdimagexobject image = (pdimagexobject)pdimage;                 file file = new file(result_folder, string.format("10948-new-engine-%s-%s.%s", currentpage, index, image.getsuffix()));                 imageioutil.writeimage(image.getimage(), image.getsuffix(), new fileoutputstream(file));                 index++;             }         }          @override         public void appendrectangle(point2d p0, point2d p1, point2d p2, point2d p3) throws ioexception { }          @override         public void clip(int windingrule) throws ioexception { }          @override         public void moveto(float x, float y) throws ioexception {  }          @override         public void lineto(float x, float y) throws ioexception { }          @override         public void curveto(float x1, float y1, float x2, float y2, float x3, float y3) throws ioexception {  }          @override         public point2d getcurrentpoint() throws ioexception { return null; }          @override         public void closepath() throws ioexception { }          @override         public void endpath() throws ioexception { }          @override         public void strokepath() throws ioexception { }          @override         public void fillpath(int windingrule) throws ioexception { }          @override         public void fillandstrokepath(int windingrule) throws ioexception { }          @override         public void shadingfill(cosname shadingname) throws ioexception { }     };     pdfgraphicsstreamengine.processpage(pdpage);     page++; }

(extractimages.java test method testextractpageimages10948new)

this code outputs images file names "10948-new-engine-1-0.tiff", "10948-new-engine-2-0.tiff", "10948-new-engine-3-0.tiff", , "10948-new-engine-4-0.tiff", i.e. 1 per page.

ps: please remember include com.github.jai-imageio:jai-imageio-core in classpath, required tiff output.

flipped images

another issue of op images appear flipped upside-down, e.g. in case of newest sample file "t1_edited.pdf". reason images indeed stored upside-down image resources in pdf.

when images drawn onto page, current transformation matrix in effect @ time mirrors image drawn vertically , creates expected appearance.

by enhancing drawimage implementation in code above, 1 can include indicators of such flips in names of exported images:

public void drawimage(pdimage pdimage) throws ioexception {     if (pdimage instanceof pdimagexobject)     {         matrix ctm = getgraphicsstate().getcurrenttransformationmatrix();         string flips = "";         if (ctm.getscalex() < 0)             flips += "h";         if (ctm.getscaley() < 0)             flips += "v";         if (flips.length() > 0)             flips = "-" + flips;         pdimagexobject image = (pdimagexobject)pdimage;         file file = new file(result_folder, string.format("t1_edited-engine-%s-%s%s.%s", currentpage, index, flips, image.getsuffix()));         imageioutil.writeimage(image.getimage(), image.getsuffix(), new fileoutputstream(file));         index++;     } }

now vertically or horizontally flipped images marked accordingly.

Search This Blog

CSS