spacypdfreader.parsers
PdfminerParser
parser(pdf_path, page_number, **kwargs)
Convert PDFs to text using pdfminer.
The pdfminer library is "pure python" library for converting PDF into text. pdfminer is relatively fast, but has low accuracy than other parsers such as pytesseract.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_path |
str
|
Path to a PDF file. |
required |
page_number |
int
|
The page number of the PDF to convert from PDF to text. Must be one digit based indexing (e.g. the first page of the PDF is page 1, as opposed to page 0). |
required |
**kwargs |
|
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
The PDF page as a string. |
Examples:
pdfminer is the default PDF to text parser and will be automatically used unless otherwise specified.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
To be more explicit import the parser and pass it into the
pdf_reader
function.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pdfminer import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser)
For more fine tuning you can pass in additional parameters to pdfminer.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pdfminer import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> params = {"caching": False}
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser, **params)
Info
See the pdfminer section in the docs for more details on the implementation of pdfminer. For more details on pdfminer refer to the pdfminer docs.
Source code in spacypdfreader/parsers/pdfminer.py
PytesseractParser
parser(pdf_path, page_number, **kwargs)
Convert a single PDF page to text using pytesseract.
The pytesseract library has the highest accuracy of all the PDF to text parsers included in spacypdfreader. It takes a different approach than other parsers. It first converts the PDF to an image, then runs an OCR engine on the image to extract the text. pytesseract results in the best quality but can be very slow compared to other parsers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_path |
str
|
Path to a PDF file. |
required |
page_number |
int
|
The page number of the PDF to convert from PDF to text. Must be one digit based indexing (e.g. the first page of the PDF is page 1, as opposed to page 0). |
required |
**kwargs |
|
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
The PDF page as a string. |
Examples:
To use pytesseract it must be explicitly imported and passed
into the pdf_reader
function.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pytesseract import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser)
For more fine tuning you can pass in additional parameters to pytesseract.
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pytesseract import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> params = {"nice": 1}
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser, **params)
Info
See the pytesseract section in the docs for more details on the implementation of pytesseract. For more details on pytesseract see the pytesseract docs.