Read PDF Text in ColdFusion with PDFBox

Last week I had to extract text content from around 1 TB worth of PDF files and then match each pdf with database records. First I tried CFPDF method to strip text from pdf, and as usual, it worked wonderfully. But it was bit too slow for my massive load of files – average 2 pdf per a second. I needed soemthing faster. After trying out all sorts of methods, I came across this excellent open source java PDF library call PDFBox. And with that, I was able to read text from more than 15 pdf files per a second.

PDFBox require another library call FontBox for it to work. So here I add both libraries in to a one jar file. Also I use javaLoader.cfc to load the jar file in to ColdFusion. You can download everything at the bottom of the page. Just unzip and run.

   1: <cffunction name="pdftotext" access="public" returntype="string">
   2:  <cfargument name="file"    type="string" required="yes">
   3:  <cfargument name="StartPage" type="string" default="">
   4:  <cfargument name="EndPage"    type="string" default="">
   5:  <cfargument name="loaderpath" hint="JavaLoader.cfc path" default="JavaLoader">
   6: 
   7:  <cfif not StructKeyExists(Application,'Reader')>
   8:  <cfset in = ArrayNew(1)>
   9:  <cfset in[1] = "#ExpandPath('./')#pdfbox.jar">
  10:  <cfset loader    = createObject("component", loaderpath).init(in)>
  11: 
  12:  <cfset Application.Reader    = loader.create("org.pdfbox.pdmodel.PDDocument")>
  13:  <cfset Application.Stripper = loader.create("org.pdfbox.util.PDFTextStripper")>
  14:  </cfif>
  15: 
  16:  <cfif val(arguments.StartPage)>
  17:  <cfset Application.Stripper.setStartPage(arguments.StartPage)>
  18:  </cfif>
  19:  <cfif val(arguments.EndPage)>
  20:  <cfset Application.Stripper.setEndPage(arguments.EndPage)>
  21:  </cfif>
  22:  <cfset mypdf    = Application.Reader.load(arguments.file)>
  23:  <cfset text    = Application.Stripper.getText(mypdf)>
  24:  <cfset Application.Reader.close()>
  25:  <cfreturn text>
  26: </cffunction>
Show/Hide Line Numbers . Full Screen . Plain

Updated: (June 24, 2011) Reader and Stripper variables move to Application scope. That address the memory issue I had with Railo

Download (pdftotext.zip)
5 Comments :
James Moberg
Friday 10 May 2013 12:09 PM
I have a question. (I know it's been 2 years since this has been posted.)

Is the output similar (or better) than what you can get with CFPDF & processddx (using PDF Utils). Does PDFBox return an array with multiple pages of text or just a single text string? (I'm guessing a single text string.)

Is the formatting/flow results of the text better/different than PDF Utils method?

Thanks.
Monday 13 May 2013 10:39 AM
This does not break pages into array positions. The main reason I use this is because this is fast. I feel text string have more spaces/tabs than PDF util and that helped me a bit if I had to fetch out specific value from the text blob using listfind(). But that behavior could depend on the source file too.
Monday 16 May 2011 11:23 AM
Hi
I figured it out with fusionreactor. It was just using up too much memory. Some of the pdfs were large. Once it hit a certain level of memory usage, cf just froze.
I changed the design to use the PDFBOX command line utility http://pdfbox.apache.org/commandlineutilities/ExtractText.html
and just created a batch file executed via cfexecute. Worked very fast and didn't mess up the coldfusion memory.
Monday 16 May 2011 09:16 AM
Hi
Love this script but I have a strange problem. Running cfmx7.. I use this to index pdfs. I can't index them directly as I need to add stuff. So I use your script to get the text from a pdf then manipulate it.. then add it to my verity collection..
the problem is after it runs a few times, my server crashes. Usually after about 5-10 conversions. Did you see this? Any work-around?
Monday 16 May 2011 11:18 AM
Not with this, but with other things javaLoader gave me problems occasionally. Try loading jar file in the cfadmin and avoiding javaLoader.cfc . To do that, also need to remove line 9 & 10 above and
Replace line 12 and 13 with
<cfset Reader = CreateObject("java", "org.pdfbox.pdmodel.PDDocument")>
<cfset Stripper = CreateObject("java", "org.pdfbox.util.PDFTextStripper")>