Extract Links from a HTML page

Here is a simple ColdFusion function to extract URL and link Titles form a given string. This outputs an array of links, URLs and Titles.

   1: <cfhttp
   2:     url    = "http://www.google.com/search?hl=en&q=paris+hilton&aq=f&oq="
   3:     userAgent = "#cgi.HTTP_USER_AGENT#">
   4: </cfhttp>
   5: <cfdump var="#getlinks(CFHTTP.FileContent)#">
   6: <cffunction name="getLinks" access="public" returntype="array" output="yes" hint="seperate Links from given HTML string, output as a array">
   7:  <cfargument name="html"    hint="HTML String with links"    required="yes">
   8:     <cfset local.startpos = 1>
   9:     <cfset local.list = ArrayNew(1)>
  10:     
  11:     <cfloop condition="local.startpos GREATER THAN 0">
  12:      <cfset local.linkpos = reFindNoCase('<a\b[^>]*>(.*?)</a>',arguments.html,local.startpos,'true')>
  13: 
  14:         <cfif val(local.linkpos.len[1])>
  15:     <cfset local.startpos = local.linkpos.len[1]+local.linkpos.pos[1]>
  16:     <cfset local.string = mid(arguments.html,local.linkpos.pos[1],local.linkpos.len[1])>
  17:             <cfset local.hrefpos = reFindNoCase('(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+##]*[\w\-\@?^=%&/~\+##])?',local.string,1,'true')>
  18:     <cfif val(local.hrefpos.pos[1])>
  19:                 <cfset local.this.a = mid(local.string,local.hrefpos.pos[1],local.hrefpos.len[1])>                
  20:                 <cfset local.this.title = reReplacenocase(local.string,'<a\b[^>]*.>',"")>
  21:                 <cfset local.this.title = reReplacenocase(local.this.title,'</a*>',"")>
  22:                 <cfset ArrayAppend(local.list,local.this)>
  23:                 <cfset StructDelete(local,'this')>
  24:             </cfif>
  25:  <cfelse>
  26:          <cfbreak>
  27:  </cfif>
  28:     </cfloop>
  29:     
  30:  <cfreturn local.list>
  31: </cffunction>
Show/Hide Line Numbers . Full Screen . Plain