Skip to content

spaCy custom extensions

When using spacypdfreader.spacypdfreader.pdf_reader custom attributes and methods are added to spacy objects.

spacy.Doc

Extension attributes

Extension Type Description
doc._.pdf_file_name str The file name of the PDF document.
doc._.first_page int The first page number of the PDF.
doc._.last_page int The last page number of the PDF.
doc._.page_range (int, int) The range of pages from the PDF.
doc._.page(int) int Return the span of text related to the page.

Extension methods

Doc._.page

Parameters:

Name Type Description Default
page_number int The PDF page number of the doc to filter on. required

Returns:

Type Description
spacy.Span The span of text from the corresponding PDF page number.

spacy.Token

Extension attributes

Extension Type Description
token._.page_number int The PDF page number in which the token was extracted from. The first page is 1.