Read PDF Text in ColdFusion with PDFBox

Friday 24 June 2011 05:28 PM

Last week I had to extract text content from around 1 TB worth of PDF files and then match each pdf with database records. First I tried CFPDF method to strip text from pdf, and as usual, it worked wonderfully. But it was bit too slow for my massive load of files – average 2 pdf per a second. I needed soemthing faster. After trying out all sorts of methods, I came across this excellent open source java PDF library call PDFBox. And with that, I was able to read text from more than 15 pdf files per a second.

PDFBox require another library call FontBox for it to work. So here I add both libraries in to a one jar file. Also I use javaLoader.cfc to load the jar file in to ColdFusion. You can download everything at the bottom of the page. Just unzip and run.

 <cffunction name="pdftotext" access="public" returntype="string">
 <cfargument name="file"    type="string" required="yes">
 <cfargument name="StartPage"  type="string" default="">
 <cfargument name="EndPage"    type="string" default="">
 <cfargument name="loaderpath"  hint="JavaLoader.cfc path" default="JavaLoader">
 
 <cfif not StructKeyExists(Application,'Reader')>
 <cfset in  = ArrayNew(1)>
 <cfset in[1] = "#ExpandPath('./')#pdfbox.jar">
 <cfset loader    = createObject("component", loaderpath).init(in)>
 
 <cfset Application.Reader    = loader.create("org.pdfbox.pdmodel.PDDocument")>
 <cfset Application.Stripper  = loader.create("org.pdfbox.util.PDFTextStripper")>
 </cfif>
 
 <cfif val(arguments.StartPage)>
 <cfset Application.Stripper.setStartPage(arguments.StartPage)>
 </cfif>
 <cfif val(arguments.EndPage)>
 <cfset Application.Stripper.setEndPage(arguments.EndPage)>
 </cfif>
 <cfset mypdf    = Application.Reader.load(arguments.file)>
 <cfset text    = Application.Stripper.getText(mypdf)>
 <cfset Application.Reader.close()>
 <cfreturn text>
 </cffunction>

Show/Hide Line Numbers . Full Screen . Plain

Updated: (June 24, 2011) Reader and Stripper variables move to Application scope. That address the memory issue I had with Railo

Demo

Download (pdftotext.zip)

Download (CredDB2.CEF)

Posted by Saman W Jayasekara at Thursday 24 December 2009 11:34 PM . ColdFusion

You May Also Like :

5 Comments :

James Moberg

Friday 10 May 2013 12:09 PM

I have a question. (I know it's been 2 years since this has been posted.)

Is the output similar (or better) than what you can get with CFPDF & processddx (using PDF Utils). Does PDFBox return an array with multiple pages of text or just a single text string? (I'm guessing a single text string.)

Is the formatting/flow results of the text better/different than PDF Utils method?

Thanks.

Sam

Monday 13 May 2013 10:39 AM

This does not break pages into array positions. The main reason I use this is because this is fast. I feel text string have more spaces/tabs than PDF util and that helped me a bit if I had to fetch out specific value from the text blob using listfind(). But that behavior could depend on the source file too.

Al Musella

Monday 16 May 2011 11:23 AM

Hi
I figured it out with fusionreactor. It was just using up too much memory. Some of the pdfs were large. Once it hit a certain level of memory usage, cf just froze.
I changed the design to use the PDFBOX command line utility http://pdfbox.apache.org/commandlineutilities/ExtractText.html
and just created a batch file executed via cfexecute. Worked very fast and didn't mess up the coldfusion memory.

Al Musella

Monday 16 May 2011 09:16 AM

Hi
Love this script but I have a strange problem. Running cfmx7.. I use this to index pdfs. I can't index them directly as I need to add stuff. So I use your script to get the text from a pdf then manipulate it.. then add it to my verity collection..
the problem is after it runs a few times, my server crashes. Usually after about 5-10 conversions. Did you see this? Any work-around?

sam

Monday 16 May 2011 11:18 AM

Not with this, but with other things javaLoader gave me problems occasionally. Try loading jar file in the cfadmin and avoiding javaLoader.cfc . To do that, also need to remove line 9 & 10 above and
Replace line 12 and 13 with
<cfset Reader = CreateObject("java", "org.pdfbox.pdmodel.PDDocument")>
<cfset Stripper = CreateObject("java", "org.pdfbox.util.PDFTextStripper")>

Drop me a Note