PDFs are intended to be an uneditable format that displays exactly as printed. This isn't necessarily a problem for extracting text, as you can still copy text to the clipboard. The problem is that pdf files often don't contain any actual âtextâ data, but instead are really just photographs of a hard-copy document. This makes it impossible to extract any text without copying manually or utilizing separate OCR software. Moreover, the user has no way of knowing before opening the file any trying to make a selection whether t're getting real text or just a photo of text, as both will have the identical .pdf extension. This can be really inconvenient for the average user, and is a major problem for those who need assistive technology such as screen readers.
Below are a few examples of what I mean by that. There is not one PDF SDK that will understand everything that you need. I suggest that if possible that you find a PDF SDK to let you read the data from the PDF, then parse it. That way if it does not parse, then you know what not to do and can look for a PDF SDK. Below is an example of a PDF file I have found that has an XML format data. If you find one, then try reading this PDF.