How to get or extract content from PDF Files

jcarlos · April 30, 2010, 4:51pm

I need to get or extract content from PDFs. I was wonder if it could be done in the same way we do by using plugins.http.getPageData(URL). Does anybody has written a plugin or figured out to do this before? Any ideas?

BTW, once I figure this out, I’ll post the entire recipe in the Forum.

jcarlos

ptalbot · April 30, 2010, 5:28pm

Apache PDFBox would be the library to use into a plugin to do what you want…
@see http://pdfbox.apache.org/
You can choose to extract plain text or even html.

I don’t know of any plugin that does it but encapsulating PDFBox functions should be quite easy.

Westy · April 30, 2010, 10:55pm

There are also free utilities out there that do this. For example, see:

http://www.a-pdf.com/text/index.htm

Notice they also have a command line version.

Dean Westover
Choices Software, Inc.

ptalbot · April 30, 2010, 11:04pm

Westy:
Free PDF Text Extractor: Convert PDF to text file. [A-PDF.com]

Yes, but this is windows only, and you have to install it yourself on the client…

jcarlos · May 1, 2010, 1:17am

Thank you very much Dean and Patrick for the TIPs.

What I am trying to do is create a plug-in or “encapsulated PDFBox functions” that will do what the http.getPageData(URL) does but over PDF files. In the same way, it should be able to capture the data from a specific location (URL), but instead of getting the page (HTML) data, it should get the text embedded in the PDF.

I will definitely use the ‘A-PDF Text Extractor’ in my laptop. I will also get the ‘A-PDF Restrictions Remover’ for $9.99. However, I don’t think that the ‘A-PDF Text Extractor’ will work in our application because of the limitation that Patrick pointed out.

We actually have a server application that is also Windows based, and it also does OCR at an incredible speed. I actually recommend this application for high volume PDF processing (Adlib Express).

The thing is that we don’t need it any longer (e.g. the OCR capability is not longer an issue since we now deal with PDF files that were originally formatted to contain text). Because of this, I now can streamline the process by simple accessing the PDF and extracting its data. This will be part of an entire method that get the PDF and put it into two different locations and then process it.

I might find a work around the whole thing. Whatever solution I built, I will share with he Forum.

Again, thank you very much!

jcarlos

Roberto_Blasco · May 2, 2010, 6:54pm

Hi jcarlos

Try tika from the Apache Project.

Best regards. Roberto.

ptalbot · May 2, 2010, 7:26pm

Tika is using PDFBox to extract PDF text content anyway…

Roberto_Blasco · May 3, 2010, 8:02am

You’re right ptalbot :oops:

Best Regards. Roberto.

sbutler · May 3, 2010, 9:52pm

iText may also do what you want. That’s what I use in the PDF pro plugin. http://sourceforge.net/projects/itext/
if you want to sponsor the development I could give you a quote on what it would take to add it to my PDF Pro plugin: http://www.servoyguy.com/servoy_compone … pro_plugin

Topic		Replies	Views
Extracting PDF pages. Classic Servoy	3	2136	September 24, 2014
PDF Splitter Classic Servoy	2	3661	June 19, 2017
How to print PDF directly in Servoy ? Classic Servoy	7	6223	June 28, 2018
Adding a header\footer to PDF output Classic Servoy	1	2229	May 23, 2014
Free PDF Pro plug-in released Classic Servoy	5	4505	March 5, 2008

How to get or extract content from PDF Files

Related topics