Examples#

Weather you are working with images or pdfs, you can use ocrpy to extract text or tables from them, further you can also classify a document automatically as an invoice, research-paper or any other document and process them accordingly. with various types of parsers (cloud/open-source) and classifiers available.

Ocrpy let’s you do most of the things with just few lines of code! Here are some examples on using ocrpy at different levels of abstraction.

text extraction with ocrpy.pipelines#

First let’s use ocrpy from it’s hightest level of abstraction, which is the ocrpy.pipelines API.

The Pipeline API can be invoked in two ways. The first method is to define the config for running the pipeline as a yaml file and and then run the pipeline by loading it as follows:

from ocrpy import TextOcrPipeline

PIPELINE_CONFIG_PATH = "ocrpy_config.yaml" # path to the pipeline config file
ocr_pipeline = TextOcrPipeline.from_config(PIPELINE_CONFIG_PATH)
ocr_pipeline.process()

alternatively you can also run a pipeline by directly instantiating the pipeline class as follows:

from ocrpy import TextOcrPipeline

SOURCE = 's3://document_bucket/' # s3 bucket or local directory or gcs bucket with your documents.
DESTINATION = 'gs://processed_document_bucket/outputs/' # s3 bucket or local directory or gcs bucket to write the processed documents.
PARSER = 'aws-textract' # or 'google-cloud-vision' or 'pytesseract'
CREDENTIALS = {"AWS": "path/to/aws-credentials.env/file",
               "GCP": "path/to/gcp-credentials.json/file"} # optional - if you are using any cloud service.

pipeline = TextOcrPipeline(source_dir=SOURCE, destination_dir=DESTINATION,
                           parser_backend=PARSER, credentials_config=CREDENTIALS)
pipeline.process()

Essentially, the pipeline classes let you process collections of documents in a single step, as shown in the above examples the pipeline class (TextOcrPipeline in the above example.) expects you to define only four parameters - source_dir, destination_dir, parser_backend and credentials and takes care of the rest.

Note

credentials are optional and they need to be provided only if you are using a cloud service such as gcs, s3, google-cloud-vision, aws-textract. Otherwise, you can ignore them and set it to None. or if using via config file you can leave the cloud_credentials.aws & cloud_credentials.gcp field empty.

If you are defining the pipeline config as a yaml file, here is how a sample config file should look like:

storage_config:
    source_dir: "s3://document_bucket/"
    destination_dir: 'gs://processed_document_bucket/outputs/'

parser_config:
    parser_backend: "aws-textract"

cloud_credentials:
    aws: ".credentials/aws-credentials.env"
    gcp: ".credentials/gcp-credentials.json

text extraction with ocrpy.io.reader, ocrpy.io.writer and ocrpy.parsers#

If you prefer to use ocrpy from it’s lowest level of abstraction, you can do so via the ocrpy.io.reader, ocrpy.io.writer and ocrpy.parsers APIs.

Lets look at an example of doing so.

from ocrpy import DocumentReader, StorageWriter, TextParser

DOC_PATH = 's3://document_bucket/example_document.pdf' # path to an image or pdf file on s3 bucket, gcs bucket or local directory.
AWS_CREDENTIALS = ".credentials/aws-credentials.env" # path to the aws credentials file.
GCP_CREDENTIALS = ".credentials/gcp-credentials.json" # path to the gcp credentials file.
PARSER_BACKEND = 'pytesseract' # or 'google-cloud-vision' or 'pytesseract'

reader = DocumentReader(file=DOC_PATH, credentials=AWS_CREDENTIALS)
writer = StorageWriter()
text_parser = TextParser(credentials=GCP_CREDENTIALS, backend=PARSER_BACKEND)

parsed_text = text_parser.parse(reader) # parse the document using the selected parser backend.
writer.write(parsed_text, "test_output/s3_output/sample-gcp.json") # write the parsed text to a file on s3 bucket, gcs bucket or local directory.

In the above example, we imported ocrpy’s DocumentReader, StorageWriter & TextParser and then used them to parse a document stored on s3 bucket with ‘pytesseract’ parser backend and wrote the parsed text to a file on local directory.

Similarly you can also read document in your local directory, gcs bucket or s3 bucket and parse it with any of the parser backends that we currently support and write the parsed text to a file on s3 bucket, gcs bucket or local directory as well.

table extraction with ocrpy.parsers.table#

Now let’s look at how to extract tables from a document using the ocrpy.parsers.table API.

from ocrpy import DocumentReader, StorageWriter, TableParser

DOC_PATH = '../documents/example_document_with_table.pdf' # path to an image or pdf file on s3 bucket, gcs bucket or local directory.
AWS_CREDENTIALS = ".credentials/aws-credentials.env" # path to the aws credentials file.

reader = DocumentReader(file=DOC_PATH)
table_parser = TableParser(credentials=AWS_CREDENTIALS)

parsed_table = table_parser.parse(reader, attempt_csv_conversion=False)

Note

ocrpy currently supports table extraction only with the aws-textract parser backend. Support for other parser backends will be added soon.

classify documents with ocrpy.experimental.document_classifier#

In this example let’s look at how you can use ocrpy to classify documents using the ocrpy.experimental.document_classifier API.

from ocrpy import DocumentReader
from ocrpy.experimental import DocumentClassifier

DOC_PATH = '../documents/document.img' # path to an image or pdf file on s3 bucket, gcs bucket or local directory.

reader = DocumentReader(file=DOC_PATH)
classifier = DocumentClassifier()

doc_types = classifier.predict(reader)

Note

ocrpy uses HuggingFace’s transformers library in the backend with a pretrained model to perform the classification. as such, please make sure you have the transformers library installed.

When you run this for the first time, it will download the pretrained model weights and store them in a local directory. Alternatively you can download or use your own pretrained model weights as well. For more info on this see Huggingface transformers library documentation.

For more information on the default model and the categories it classifies to, please refer ocrpy.experimental.document_classifier.

Parse layout with ocrpy.experimental.layout_parser#

In this example let’s look at how you can use ocrpy to parse layout from a document using the ocrpy.experimental.layout_parser API.

from ocrpy import DocumentReader, TextParser
from ocrpy.experimental import LayoutParser

DOC_PATH = '../documents/document.img' # path to an image or pdf file on s3 bucket, gcs bucket or local directory.

reader = DocumentReader(file=DOC_PATH)
text_parser = TextParser()
layout_parser = LayoutParser()

parsed_layout = layout_parser.parse(reader, text_parser)

Note

ocrpy uses Microsoft’s LayoutParser library in the backend to perform the layout parsing. as such, please make sure you have the layoutparser library installed, if not please install it from LayoutParser.

When you run this for the first time, it will download the pretrained model weights and store them in a local directory. Alternatively you can download or use your own pretrained model weights as well. The model weights can be downloaded from LayoutParser Model Catalog.