API Reference#

ocrpy.io#

The ocrpy.io module provides utilities for reading and writing data to and from various types of storage. It let’s you read and write Image (.png and .jpg) or Pdf files from Various types of cloud storage providers like Amazon S3, Google Cloud Storage, Azure Blob Storage or local file system.

These functionalities are primarily exposed through the ocrpy.io.reader.DocumentReader and ocrpy.io.writer.StorageWriter classes & they are intended to be used along with the various types of parsers we support which are exposed through the ocrpy.parsers.text.text_parser.TextParser class.

class ocrpy.io.reader.DocumentReader(file: str, credentials: Optional[str] = None)#

Reads an image or a pdf file from a local or remote location.

Note: Currently supports Google Storage and Amazon S3 Remote Files.

filestr

The path to the file to be read.

credentialsstr

The path to the credentials file.

Note: If the Remote storage is AWS S3, the credentials file must be in the .env format. If the Remote storage is Google Storage, the credentials file must be in the .json format.

DocumentReader.read() Union[Generator, ByteString]#

Reads the file from a local or remote location and returns the data in byte-string for an image or as a generator of byte-strings for a pdf.

dataUnion[bytes, List[bytes]]

The data in byte-string for an image or as a generator of byte-strings for a pdf.

class ocrpy.io.writer.StorageWriter(credentials: Optional[str] = None)#

Write a parser output to a given location (supports write to local storage, S3 and GS).

credentialsOptional[str]

default : None

The credentials to use for the selected storage location.

Note: If the storage location is AWS S3, the credentials file must be in the .env format and If the storage location is Google Storage, the credentials file must be in the .json format.

StorageWriter.write(data: Dict, file: str) None#

Write the parser output to a given location (supports write to local storage, S3 and GS).

dataDict

The data to be written.

filestr

filename/path to the file to be written.

None

ocrpy.parsers#

The ocrpy.parsers module provides a high level interface for parsing text and table from various types of documents with various types of backends we support. Currently, we support the following parsers:

class ocrpy.parsers.text.text_parser.TextParser(credentials: Optional[str] = None, backend: str = 'pytesseract')#

High level interface for multiple text ocr backends. Note: Currently only supports Pytesseract, Google Cloud Vision and Amazon Textract.

backendstr

The name of the backend to use. default: “pytesseract” alternative options: “pytesseract”, “aws-textract”, “google-cloud-vision”

credentialsOptional[str]

The credentials to use for the selected backend. default: None

TextParser.parse(reader: DocumentReader) Dict#

Parse the data from a given reader. The reader should be an instance of ocrpy.io.reader.DocumentReader.

readerDocumentReader

The reader to parse the data from.

dataDict

The parsed data.

class ocrpy.parsers.table.table_parser.TableParser(credentials: Optional[str] = None, backend: str = 'aws-textract')#

High level interface for multiple table ocr backends. Note: Currently only supports Amazon Textract.

backendstr

The name of the backend to use. default: “aws-textract”

credentialsstr

The credentials to use for the selected backend.

TableParser.parse(reader: DocumentReader, attempt_csv_conversion: bool = False) Union[List, Dict]#

Parse the document and extract the tables.

readerDocumentReader

The reader to parse the data from.

attempt_csv_conversionbool

If True, attempt to convert the table to CSV. default: False

tables : Union[List, Dict]

Note: returns a list of lists if attempt_csv_conversion is False. Otherwise, returns a dictionary of pandas dataframes. (each item represents an individual table in a pdf document)

ocrpy.parsers.text#

class ocrpy.parsers.text.aws_text.AwsTextOCR(reader: Any, credentials: str)#

AWS Textract OCR Engine

reader: Any

Reader object that can be used to read the document.

credentialsstr

Path to credentials file. Note: The credentials file must be in .env format.

AwsTextOCR.parse() Dict[int, Dict]#

Parses the document and returns the ocr data as a dictionary of pages along with additional metadata.

parsed_datadict

Dictionary of pages.

class ocrpy.parsers.text.gcp_text.GcpTextOCR(reader: Any, credentials: str)#

Google Cloud Vision OCR Engine

readerAny

Reader object that can be used to read the document.

credentialsstr

Path to credentials file. Note: The credentials file must be in .json format.

GcpTextOCR.parse()#

Parses the document and returns the ocr data as a dictionary of pages along with additional metadata.

parsed_datadict

Dictionary of pages.

class ocrpy.parsers.text.tesseract_text.TesseractTextOCR(reader: Any, credentials=None)#

PyTesseract OCR Engine

reader: Any

Reader object that can be used to read the document.

credentialsNone

Credentials for the OCR engine (if any).

TesseractTextOCR.parse()#

Parses the document and returns the ocr data as a dictionary of pages along with additional metadata.

parsed_datadict

Dictionary of pages.

ocrpy.parsers.table#

class ocrpy.parsers.table.aws_table.AwsTableOCR(reader: Any, credentials: Optional[str] = None)#

AWS Table Parser - This parser uses AWS Textract to analyze the document and extract the tables.

credentialsstr

Path to credentials file. Note: The credentials file must be in .env format.

AwsTableOCR.parse() List[List]#

Parse the document and extract the tables.

documentstr

Path to the document to be parsed.

tablesList[List]

List of list of tabular data.

ocrpy.parsers.table.aws_table.table_to_csv(table_data: List[List]) Dict[int, DataFrame]#

Convert the table data to CSV.

table_dataList[List]

Table data extracted from the document using the parser.

table_dataDict[int, pd.DataFrame]

Table data in CSV format.

ocrpy.pipelines#

The ocrpy.pipelines module provides a set of High level classes that essentially wrap different types of readers, writers & parser backends and let the user do ocr on collections of documents in either remote or local storage and write the results to remote or local storage of their choice.

Alternatively, it also lets the users to do ocr on document collections and index the results to a database/search backend of their choice.

class ocrpy.pipelines.config.PipelineConfig(config_path)#

OCR Pipeline Configuration container.

config_pathstr

Path to the ocrpy config file.

a sample config file looks like this:

storage_config:

source_dir: /path/to/source/dir

destination_dir: /path/to/destination/dir

parser_config:

parser_backend: pytesseract

cloud_credentials:

aws: /path/to/aws-credentials.env

gcp: /path/to/gcp-credentials.json

class ocrpy.pipelines.text_pipeline.TextOcrPipeline(source_dir: str, destination_dir: str, parser_backend: str = 'pytesseract', credentials_config=None)#

TextOCRPipeline provides a high level interface to run ocr on PDFs, JPGs, and PNGs in either local or cloud storage(AWS S3 or Google Cloud Storage) with a configurable parser backend.

Note: Supported parser backends are - aws-textract, google-cloud-vision, pytesseract.

source_dirstr

Path to the directory containing the documents to be processed.

destination_dirstr

Path to the directory where the processed documents will be stored.

parser_backendstr

The parser backend to be used for processing. default: “pytesseract” options: “aws-textract”, “google-cloud-vision”, “pytesseract”

credentials_configdict

A dictionary containing the credentials for the parser or cloud storage backends. default: None example: {“AWS”: “aws-credentials.env”, “GCP”: “gcp-credentials.json”}

classmethod TextOcrPipeline.from_config(config_path: str) TextOcrPipeline#

Allows you to create a TextOcrPipeline from a config file (config file format: yaml).

config_pathstr

Path to the config file.

TextOcrPipeline

A TextOcrPipeline object.

storage_config:

source_dir: /path/to/source/dir destination_dir: /path/to/destination/dir

parser_config:

parser_backend: pytesseract

cloud_credentials:

aws: /path/to/aws-credentials.env gcp: /path/to/gcp-credentials.json

property TextOcrPipeline.pipeline_config#
TextOcrPipeline.process() None#

Runs the pipeline with the given configuration and writes the output to the destination directory.

Returns:

None

class ocrpy.pipelines.index_pipeline.TextOcrIndexPipeline(source_dir: str, destination_dir: str, parser_backend: str = 'pytesseract', credentials_config=None, database_backend: str = 'sql', database_config: Optional[Dict] = None)#

TextOcrIndexPipeline provides a high level interface to run ocr on PDFs, JPGs, and PNGs in either local or cloud storage(AWS S3 or Google Cloud Storage) with a configurable parser backend and then index the results to a database backend of your choice.

Note: Supported parser backends are - aws-textract, google-cloud-vision, pytesseract & the supported database/search backends are - sql, opensearch & elasticsearch.

database_backendstr

The database backend to be used for indexing. default: “sql” options: “sql”, “opensearch”, “elasticsearch”

database_configdict

A dictionary containing the credentials and other params for the database backend. default: None example: {“sql”: {“db_url”: “sqlite:///test.db”, “db_table”: “test”}}

classmethod TextOcrIndexPipeline.from_config(config_path: str) TextOcrPipeline#

Allows you to create a TextOcrPipeline from a config file (config file format: yaml).

config_pathstr

Path to the config file.

TextOcrPipeline

A TextOcrPipeline object.

storage_config:

source_dir: /path/to/source/dir destination_dir: /path/to/destination/dir

parser_config:

parser_backend: pytesseract

cloud_credentials:

aws: /path/to/aws-credentials.env gcp: /path/to/gcp-credentials.json

property TextOcrIndexPipeline.pipeline_config#
TextOcrIndexPipeline.process()#

Runs the pipeline with the given configuration and writes the output to the destination directory.

Returns:

None

ocrpy.experimental#

The ocrpy.experimental module contains experimental features that are not yet stable and may change or be removed in future releases.

Currently it exposes the ocrpy.experimental.document_classifier.DocumentClassifier class which can be used to classify documents into various categories & ocrpy.experimental.layout_parser.DocumentLayoutParser class which can be used to identify different components of a document like text, title, table, figures etc.

These can be used along with the ocr pipelines, as preprocessing utils to identify different types of documents and their layout and launch appropriate ocr pipelines for custom processing.

class ocrpy.experimental.document_classifier.DocumentClassifier(model_name: str = 'microsoft/dit-base-finetuned-rvlcdip')#

DocumentClassifier utilises a fine tuned Document image classifier (DiT) model to predict the document type. The default model is a dit-base model trained on the rvlcdip dataset.

You can also choose alternate models available on HuggingFace modelshub at https://huggingface.co/models.

model_namestr

The name of the model to use. Should be a valid model name from HuggingFace modelshub.

default: “microsoft/dit-base-finetuned-rvlcdip”

  • The model is trained on the rvlcdip dataset and can identify the following document types:

letter, form, email, handwritten, advertisement, scientific report, scientific publication, specification, file folder, news article, budget, invoice, presentation, questionnaire, resume, memo

DocumentClassifier.predict(reader: DocumentReader) List#

Predict the document type of the document in the reader.

readerDocumentReader

The reader containing the document to predict.

predicted_labels: List

A list of predicted document types.

document types can be one of the following: letter, form, email, handwritten, advertisement, scientific report, scientific publication, specification, file folder, news article, budget, invoice, presentation, questionnaire, resume, memo

class ocrpy.experimental.layout_parser.DocumentLayoutParser(model_name: str = 'lp://PubLayNet/ppyolov2_r50vd_dcn_365e/config')#

Document Layout Parser utilises a fine tuned PaddleDetection model to detect the layout of the document. The default model is a rcnn based model trained on the publayent dataset.

You can also choose alternate models available on layoutparser modelshub at https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html

model_namestr

The name of the model to use. Should be a valid model name from HuggingFace modelshub.

default: “lp://PubLayNet/ppyolov2_r50vd_dcn_365e/config”

  • The model is trained on the Publaynet dataset and can detect the following blocks from the document:
    • text, title, list, table, figure

  • For more information on the dataset please refer this paper: https://arxiv.org/abs/1908.07836

DocumentLayoutParser.parse(reader: DocumentReader, ocr: TextParser) List#

Predict the document type of the document in the reader.

readerDocumentReader

The reader containing the document to predict.

ocrTextParser

The parser to use for OCR.

ocr_result: dict

Blocks result of the OCR will be updated with the layout information.

block can be following type: text, title, list, table, figure

ocrpy.utils#

The ocrpy.utils module provides various helper functions used by the other modules in the package.

ocrpy.utils.utils.guess_extension(file_path: str) str#

Guesses the file extension of the file.

file_pathstr

Path to the file.

extension_typestr

File extension.

ocrpy.utils.utils.guess_storage(file_path: str) str#

Guesses the storage type of the file.

file_pathstr

Path to the file.

storage_typestr

Storage type.