Overview#

At its core, ocrpy is a Python library that can read Pdf and Image documents from Any cloud storage service or a local file system, and then perform a set of operations on these documents like identification of document type, parsing the layout of the document, and extracting text &/or tables from the document and then writing the results to cloud storage, local file system or a database.

Ocrpy Library Overview#

Ocrpy provides Five core APIs for performing the above mentioned operations. These APIs are:

Readers - Reads the data from cloud storage or local file system and returns a document object.
Parsers - Parses the document layout, extracts text and/or tables from the document & identify the document type.
Classifiers - Classify documents into various types like Invoice, Receipt, etc. and also identify the various layout components within those documents.
Writers - Writes the results to cloud storage, local file system or a database.

5. Pipelines - A High level abstraction that combines readers, parsers and writers to perform the above mentioned operations on a collection of documents in a single call.

Readers#

Readers lets you read data from multiple cloud storage or local file system and return a standar document object. We currently support the following readers:

Reader	Description	Supported File Formats	Storage Type
S3Reader	Reads data from Amazon S3 storage	Pdf, JPG and PNG	Aws S3
GcsReader	Reads data from Google Cloud Storage	Pdf, JPG and PNG	Google Cloud Storage
AzureBlobReader	Reads data from Azure Blob Storage	Pdf, JPG and PNG	Azure Blob Storage
LocalFileReader	Reads data from local file system	Pdf, JPG and PNG	Local File System

Parsers#

Parsers extracts text and/or tables from the Pdfs & Images and return them in a standardised format. We currently support the following parsers:

Parser	Description	Extracts	Implementation
AwsTextOcr	Extracts text from Pdfs & Images using Amazon Textract OCR service	Text	Supported
GcpTextOcr	Extracts text from Pdfs & Images using Google Cloud Vision OCR service	Text	Supported
AwsTableOcr	Extracts tables from Pdfs & Images using Amazon Textract OCR service	Table	Supported
TesseractTextOcr	Extracts text from Pdfs & Images using Tesseract OCR library.	Text	Supported
AzureTextOcr	Extracts text from Pdfs & Images using Azure Textract OCR service	Text	WIP
AzureTableOcr	Extracts tables from Pdfs & Images using Azure Textract OCR service	Table	WIP
GcpTableOcr	Extracts tables from Pdfs & Images using Google Cloud Vision OCR service	Table	WIP

Classifiers#

Classifiers identify the type of the document & additionaly identify different layout components within a document/page. They integrate with Huggingface’s transformers library to perform the above mentioned operations. We currently support the following classifiers:

Classifier	Description	Backbone Model
DocumentClassifier	Classifies a given document into 16 different categories (ex: invoice, research_paper …)	microsoft/dit-base-finetuned-rvlcdip
LayoutClassifier	Detects and classifies the layout of a given document (ex: header, footer, table, image …)	PubLayNet/ppyolov2_r50vd_dcn_365e

Writers#

Writers consume the output from parsers & classifiers and write the results to cloud storage, local file system or a database. We currently support the following writers:

Writer	Description	Supported File Formats	Storage Type
S3Writer	Writes data to Amazon S3 storage	Json	Aws S3
GcsWriter	Writes data to Google Cloud Storage	Json	Google Cloud Storage
AzureBlobWriter	Writes data to Azure Blob Storage	Json	Azure Blob Storage
LocalFileReader	Writes data to local file system	Json	Local File System

Pipelines#

Pipelines combine readers, parsers, classifiers and writers and work on a collection of documents in a single call. We currently support the following default pipelines:

Pipeline	Description
TextOcrPipeline	TextOCRPipeline provides a high level interface to run ocr on PDFs, JPGs, and PNGs in either local or cloud storage(AWS S3 or Google Cloud Storage) with a configurable parser backend.
TextOcrIndexPipeline	TextOcrIndexPipeline provides a high level interface to run ocr on PDFs, JPGs, and PNGs in either local or cloud storage(AWS S3 or Google Cloud Storage) with a configurable parser backend and then index the results to a database backend of your choice.