Skip to content

spacypdfreader.parsers

spacypdfreader.parsers.pdfminer

PdfminerParser

This class has bee included for backwards compatibility. Do not use.

Source code in spacypdfreader/parsers/pdfminer.py
class PdfminerParser:
    """This class has bee included for backwards compatibility. Do not use."""

    def __init__(self):
        return None

parser(pdf_path: str, page_number: int, **kwargs)

Convert PDFs to text using pdfminer.

The pdfminer library is "pure python" library for converting PDF into text. pdfminer is relatively fast, but has low accuracy than other parsers such as pytesseract.

Parameters:

Name Type Description Default
pdf_path str

Path to a PDF file.

required
page_number int

The page number of the PDF to convert from PDF to text. Must be one digit based indexing (e.g. the first page of the PDF is page 1, as opposed to page 0).

required
**kwargs

**kwargs will be passed to pdfminer.high_level.extract_text.

{}

Returns:

Name Type Description
str

The PDF page as a string.

Examples:

pdfminer is the default PDF to text parser and will be automatically used unless otherwise specified.

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

To be more explicit import the parser and pass it into the pdf_reader function.

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pdfminer import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser)

For more fine tuning you can pass in additional parameters to pdfminer.

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pdfminer import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> params = {"caching": False}
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser, **params)
Info

See the pdfminer section in the docs for more details on the implementation of pdfminer. For more details on pdfminer refer to the pdfminer docs.

Source code in spacypdfreader/parsers/pdfminer.py
def parser(pdf_path: str, page_number: int, **kwargs):
    """Convert PDFs to text using pdfminer.

    The pdfminer library is "pure python" library for converting PDF into text.
    pdfminer is relatively fast, but has low accuracy than other parsers such as
    [pytesseract](/parsers/#pytesseract).

    Parameters:
        pdf_path: Path to a PDF file.
        page_number: The page number of the PDF to convert from PDF to text. Must be one
            digit based indexing (e.g. the first page of the PDF is page 1, as
            opposed to page 0).
        **kwargs: `**kwargs` will be passed to
            [`pdfminer.high_level.extract_text`](https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-text).

    Returns:
        str: The PDF page as a string.

    Examples:
        pdfminer is the default PDF to text parser and will be automatically
        used unless otherwise specified.

        >>> import spacy
        >>> from spacypdfreader import pdf_reader
        >>>
        >>> nlp = spacy.load("en_core_web_sm")
        >>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

        To be more explicit import the parser and pass it into the
        `pdf_reader` function.

        >>> import spacy
        >>> from spacypdfreader import pdf_reader
        >>> from spacypdfreader.parsers.pdfminer import parser
        >>>
        >>> nlp = spacy.load("en_core_web_sm")
        >>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser)

        For more fine tuning you can pass in additional parameters to pdfminer.

        >>> import spacy
        >>> from spacypdfreader import pdf_reader
        >>> from spacypdfreader.parsers.pdfminer import parser
        >>>
        >>> nlp = spacy.load("en_core_web_sm")
        >>> params = {"caching": False}
        >>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser, **params)

    Info:
        See the [pdfminer section](/parsers/#pdfminer) in the docs for more
        details on the implementation of pdfminer. For more details on pdfminer
        refer to the [pdfminer docs](https://pdfminersix.readthedocs.io/en/latest/).
    """
    # Check to see if the users has provided the `page_numbers` kwarg. This is not
    # valid. So raise an error. See: https://github.com/SamEdwardes/spacypdfreader/issues/16
    if "page_numbers" in kwargs:
        raise ValueError(
            "The `page_numbers` kwarg is not valid when using the pdfminer parser. "
            "Please use `page_range` instead. For example: ",
            "``",
        )

    # pdfminer uses zero indexed page numbers. Therefore need to remove 1
    # from the page count.
    page_number -= 1
    text = extract_text(pdf_path, page_numbers=[page_number], **kwargs)
    return text

spacypdfreader.parsers.pytesseract

PytesseractParser

This class has bee included for backwards compatibility. Do not use.

Source code in spacypdfreader/parsers/pytesseract.py
class PytesseractParser:
    """This class has bee included for backwards compatibility. Do not use."""

    def __init__(self):
        return None

parser(pdf_path: str, page_number: int, **kwargs)

Convert a single PDF page to text using pytesseract.

The pytesseract library has the highest accuracy of all the PDF to text parsers included in spacypdfreader. It takes a different approach than other parsers. It first converts the PDF to an image, then runs an OCR engine on the image to extract the text. pytesseract results in the best quality but can be very slow compared to other parsers.

Parameters:

Name Type Description Default
pdf_path str

Path to a PDF file.

required
page_number int

The page number of the PDF to convert from PDF to text. Must be one digit based indexing (e.g. the first page of the PDF is page 1, as opposed to page 0).

required
**kwargs

**kwargs will be passed to pytesseract.image_to_string.

{}

Returns:

Name Type Description
str

The PDF page as a string.

Examples:

To use pytesseract it must be explicitly imported and passed into the pdf_reader function.

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pytesseract import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser)

For more fine tuning you can pass in additional parameters to pytesseract.

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>> from spacypdfreader.parsers.pytesseract import parser
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> params = {"nice": 1}
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser, **params)
Info

See the pytesseract section in the docs for more details on the implementation of pytesseract. For more details on pytesseract see the pytesseract docs.

Source code in spacypdfreader/parsers/pytesseract.py
def parser(pdf_path: str, page_number: int, **kwargs):
    """Convert a single PDF page to text using pytesseract.

    The pytesseract library has the highest accuracy of all the PDF to text
    parsers included in spacypdfreader. It takes a different approach than other
    parsers. It first converts the PDF to an image, then runs an OCR engine on
    the image to extract the text. pytesseract results in the best quality but
    can be very slow compared to other parsers.

    Parameters:
        pdf_path: Path to a PDF file.
        page_number: The page number of the PDF to convert from PDF to text. Must be one
            digit based indexing (e.g. the first page of the PDF is page 1, as
            opposed to page 0).
        **kwargs: `**kwargs` will be passed to
            [`pytesseract.image_to_string`](https://github.com/madmaze/pytesseract/blob/8fe7cd1faf4abc0946cb69813d535198772dbb6c/pytesseract/pytesseract.py#L409-L426).

    Returns:
        str: The PDF page as a string.

    Examples:
        To use pytesseract it must be explicitly imported and passed
        into the `pdf_reader` function.

        >>> import spacy
        >>> from spacypdfreader import pdf_reader
        >>> from spacypdfreader.parsers.pytesseract import parser
        >>>
        >>> nlp = spacy.load("en_core_web_sm")
        >>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser)

        For more fine tuning you can pass in additional parameters to
        pytesseract.

        >>> import spacy
        >>> from spacypdfreader import pdf_reader
        >>> from spacypdfreader.parsers.pytesseract import parser
        >>>
        >>> nlp = spacy.load("en_core_web_sm")
        >>> params = {"nice": 1}
        >>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, parser, **params)

    Info:
        See the [pytesseract section](/parsers/#pytesseract) in the docs for
        more details on the implementation of pytesseract. For more details on
        pytesseract see the [pytesseract docs](https://github.com/madmaze/pytesseract).
    """

    with tempfile.TemporaryDirectory() as tmp_dir:
        # Convert pdf page to image.
        file_name = convert_from_path(
            pdf_path=pdf_path,
            output_folder=tmp_dir,
            paths_only=True,
            first_page=page_number,
            last_page=page_number + 1,
        )[0]

        # Convert images to text.
        file_path = Path(tmp_dir, str(file_name))
        text = str(image_to_string(Image.open(file_path), **kwargs))

    return text