Parsers
Extracting text from PDF documents can be challenging. There are several different options in the python ecosystem. spacypdfreader makes it easy to extract text from PDF documents. At this time spacypdfreader has built in support for two options:
You can also bring your own custom PDF to text parser to use in spacypdfreader.
Tip
💁♂️ Would you like to see another parser added? Please submit an issue on GitHub and the maintainer will look into adding support.
Tip
Parsing big PDFs can be slow. For example, parsing a 166 page PDF document on an M1 mac took 166 seconds. If you are working with larger documents try breaking them into smaller documents and use multiprocessing.
Comparison of built in parsers
All PDF to text parsers have their tradeoffs. The table below summaries the pros and cons of the built in parsers.
pdfminer | pytesseract | |
---|---|---|
When to use | ⚡️ When speed is more important than accuracy. | 🎓 When accuracy is more important than speed. |
Accuracy | 👌 Medium: from my experience pdfminer struggles with documents where the text is in one or more columns. | 👍 High: very good. Performs well on messy documents (e.g hand written text, PDFs with multiple columns of text on a single page). |
Speed | 👌 Medium: the text extraction is not instant, but it does not take forever. | 👎 Slow: the text extraction is very slow and will take hours on hundreds of pages. |
Installation | 👍 Easy: pure python, if you have installed spacypdfreader you already have everything you need. | 👎 Complicated: relies on additional non-python dependencies that can be complicated for beginners to install. |
How it works | Text is extracted directly from PDF using only Python. | Each pdf page is converted into an image. Optical character recognition is then run on each image. |
pdfminer
A pure Python library for extracting text from PDFs.
Installation
No action required, pdfminer will automatically be installed when you install spacypdfreader.
Usage
pdfminer is the default PDF to text extraction parser for spacypdfreader:
import spacy
from spacypdfreader import pdf_reader
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
You could also be more verbose and pass in additional parameters. For a list of available parameters please refer to the pdfminer documentation for the extract_function
function.
import spacy
from spacypdfreader import pdf_reader
from spacypdfreader.parsers.pdfminer import PdfminerParser
nlp = spacy.load("en_core_web_sm")
params = {"caching": False}
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PdfminerParser, **params)
pytesseract
A PDf to text extraction engine that uses Googles tesseract OCR engine.
Installation
You can install most of the dependencies by pip installing spacypdfreader with some optional dependencies:
Unfortunately this will not always install all of the dependencies because some of them are non-python related. I find that installing pytesseract can be a little bit tricky for beginners. Please refer to https://github.com/madmaze/pytesseract#installation for details on how to install pytesseract if the above does not work.
Usage
To use pytesseract you must pass the pytesseract parser into the pdf_parser
argument. For a list of available parameters you can pass in refer the documentation for the image_to_string
function from pytesseract.
import spacy
from spacypdfreader import pdf_reader
from spacypdfreader.parsers.pytesseract import PytesseractParser
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PytesseractParser)
Bring your own parser
spacypdfreader allows your to bring your custom PDF parser. For examples of how to implement your own parser refer to:
- https://github.com/SamEdwardes/spacypdfreader/blob/main/spacypdfreader/parsers/pdfminer.py, or
- https://github.com/SamEdwardes/spacypdfreader/blob/main/spacypdfreader/parsers/pytesseract.py.
To work with spacypdfreader a parser must be a function that:
- Has an argument named
pdf_path
. - Has an argument named
page_number
. This argument should use 1 based indexing. E.g. the value 1 refers to the first page of the PDF. - The function should return the text only for a single page of the PDF. This allows spacypdfreader to execute faster with multi-processing.
Warning
Version 0.3.0
changed how parsers are implemented. If you have created a custom parser that works with an older version of spacypdfreader it will need to be reimplemented.