Read PDF Text in ColdFusion with PDFBox
Friday 24 June 2011 05:28 PM
Last week I had to extract text content from around 1 TB worth of PDF files and then match each pdf with database records. First I tried CFPDF method to strip text from pdf, and as usual, it worked wonderfully. But it was bit too slow for my massive load of files – average 2 pdf per a second. I needed soemthing faster. After trying out all sorts of methods, I came across this excellent open source java PDF library call PDFBox. And with that, I was able to read text from more than 15 pdf files per a second.
PDFBox require another library call FontBox for it to work. So here I add both libraries in to a one jar file. Also I use javaLoader.cfc to load the jar file in to ColdFusion. You can download everything at the bottom of the page. Just unzip and run.
Updated 1: <cffunction name="pdftotext" access="public" returntype="string"> 2: <cfargument name="file" type="string" required="yes"> 3: <cfargument name="StartPage" type="string" default=""> 4: <cfargument name="EndPage" type="string" default=""> 5: <cfargument name="loaderpath" hint="JavaLoader.cfc path" default="JavaLoader"> 6: 7: <cfif not StructKeyExists(Application,'Reader')> 8: <cfset in = ArrayNew(1)> 9: <cfset in = "#ExpandPath('./')#pdfbox.jar"> 10: <cfset loader = createObject("component", loaderpath).init(in)> 11: 12: <cfset Application.Reader = loader.create("org.pdfbox.pdmodel.PDDocument")> 13: <cfset Application.Stripper = loader.create("org.pdfbox.util.PDFTextStripper")> 14: </cfif> 15: 16: <cfif val(arguments.StartPage)> 17: <cfset Application.Stripper.setStartPage(arguments.StartPage)> 18: </cfif> 19: <cfif val(arguments.EndPage)> 20: <cfset Application.Stripper.setEndPage(arguments.EndPage)> 21: </cfif> 22: <cfset mypdf = Application.Reader.load(arguments.file)> 23: <cfset text = Application.Stripper.getText(mypdf)> 24: <cfset Application.Reader.close()> 25: <cfreturn text> 26: </cffunction>
: (June 24, 2011) Reader and Stripper variables move to Application scope. That address the memory issue I had with Railo
Posted by Saman W Jayasekara at Thursday 24 December 2009 11:34 PM
Friday 10 May 2013 12:09 PM
I have a question. (I know it's been 2 years since this has been posted.)
Is the output similar (or better) than what you can get with CFPDF & processddx (using PDF Utils). Does PDFBox return an array with multiple pages of text or just a single text string? (I'm guessing a single text string.)
Is the formatting/flow results of the text better/different than PDF Utils method?
Monday 13 May 2013 10:39 AM
This does not break pages into array positions. The main reason I use this is because this is fast. I feel text string have more spaces/tabs than PDF util and that helped me a bit if I had to fetch out specific value from the text blob using listfind(). But that behavior could depend on the source file too.
Monday 16 May 2011 11:23 AM
I figured it out with fusionreactor. It was just using up too much memory. Some of the pdfs were large. Once it hit a certain level of memory usage, cf just froze.
I changed the design to use the PDFBOX command line utility http://pdfbox.apache.org/commandlineutilities/ExtractText.html
and just created a batch file executed via cfexecute. Worked very fast and didn't mess up the coldfusion memory.
Monday 16 May 2011 09:16 AM
Love this script but I have a strange problem. Running cfmx7.. I use this to index pdfs. I can't index them directly as I need to add stuff. So I use your script to get the text from a pdf then manipulate it.. then add it to my verity collection..
the problem is after it runs a few times, my server crashes. Usually after about 5-10 conversions. Did you see this? Any work-around?
Monday 16 May 2011 11:18 AM
Not with this, but with other things javaLoader gave me problems occasionally. Try loading jar file in the cfadmin and avoiding javaLoader.cfc . To do that, also need to remove line 9 & 10 above and
Replace line 12 and 13 with
<cfset Reader = CreateObject("java", "org.pdfbox.pdmodel.PDDocument")>
<cfset Stripper = CreateObject("java", "org.pdfbox.util.PDFTextStripper")>