I need to get or extract content from PDFs. I was wonder if it could be done in the same way we do by using plugins.http.getPageData(URL). Does anybody has written a plugin or figured out to do this before? Any ideas?
BTW, once I figure this out, I’ll post the entire recipe in the Forum.
Apache PDFBox would be the library to use into a plugin to do what you want… @seehttp://pdfbox.apache.org/
You can choose to extract plain text or even html.
I don’t know of any plugin that does it but encapsulating PDFBox functions should be quite easy.
Thank you very much Dean and Patrick for the TIPs.
What I am trying to do is create a plug-in or “encapsulated PDFBox functions” that will do what the http.getPageData(URL) does but over PDF files. In the same way, it should be able to capture the data from a specific location (URL), but instead of getting the page (HTML) data, it should get the text embedded in the PDF.
I will definitely use the ‘A-PDF Text Extractor’ in my laptop. I will also get the ‘A-PDF Restrictions Remover’ for $9.99. However, I don’t think that the ‘A-PDF Text Extractor’ will work in our application because of the limitation that Patrick pointed out.
We actually have a server application that is also Windows based, and it also does OCR at an incredible speed. I actually recommend this application for high volume PDF processing (Adlib Express).
The thing is that we don’t need it any longer (e.g. the OCR capability is not longer an issue since we now deal with PDF files that were originally formatted to contain text). Because of this, I now can streamline the process by simple accessing the PDF and extracting its data. This will be part of an entire method that get the PDF and put it into two different locations and then process it.
I might find a work around the whole thing. Whatever solution I built, I will share with he Forum.