How to get or extract content from PDF Files

I need to get or extract content from PDFs. I was wonder if it could be done in the same way we do by using plugins.http.getPageData(URL). Does anybody has written a plugin or figured out to do this before? Any ideas?

BTW, once I figure this out, I’ll post the entire recipe in the Forum.

jcarlos

Apache PDFBox would be the library to use into a plugin to do what you want…
@see http://pdfbox.apache.org/
You can choose to extract plain text or even html.

I don’t know of any plugin that does it but encapsulating PDFBox functions should be quite easy.

There are also free utilities out there that do this. For example, see:

http://www.a-pdf.com/text/index.htm

Notice they also have a command line version.

Dean Westover
Choices Software, Inc.

Westy:
Free PDF Text Extractor: Convert PDF to text file. [A-PDF.com]

Yes, but this is windows only, and you have to install it yourself on the client…

Thank you very much Dean and Patrick for the TIPs.

What I am trying to do is create a plug-in or “encapsulated PDFBox functions” that will do what the http.getPageData(URL) does but over PDF files. In the same way, it should be able to capture the data from a specific location (URL), but instead of getting the page (HTML) data, it should get the text embedded in the PDF.

I will definitely use the ‘A-PDF Text Extractor’ in my laptop. I will also get the ‘A-PDF Restrictions Remover’ for $9.99. However, I don’t think that the ‘A-PDF Text Extractor’ will work in our application because of the limitation that Patrick pointed out.

We actually have a server application that is also Windows based, and it also does OCR at an incredible speed. I actually recommend this application for high volume PDF processing (Adlib Express).

The thing is that we don’t need it any longer (e.g. the OCR capability is not longer an issue since we now deal with PDF files that were originally formatted to contain text). Because of this, I now can streamline the process by simple accessing the PDF and extracting its data. This will be part of an entire method that get the PDF and put it into two different locations and then process it.

I might find a work around the whole thing. Whatever solution I built, I will share with he Forum.

Again, thank you very much!

jcarlos

Hi jcarlos

Try tika from the Apache Project.

Best regards. Roberto.

Tika is using PDFBox to extract PDF text content anyway…

You’re right ptalbot :oops:

Best Regards. Roberto.

iText may also do what you want. That’s what I use in the PDF pro plugin. http://sourceforge.net/projects/itext/
if you want to sponsor the development I could give you a quote on what it would take to add it to my PDF Pro plugin: http://www.servoyguy.com/servoy_compone … pro_plugin