API Reference#
ocrpy.io
#
The ocrpy.io
module provides utilities for reading and writing data to and from various types of storage.
It let’s you read and write Image (.png and .jpg) or Pdf files from Various types of cloud storage providers
like Amazon S3, Google Cloud Storage, Azure Blob Storage or local file system.
These functionalities are primarily exposed through the ocrpy.io.reader.DocumentReader
and ocrpy.io.writer.StorageWriter
classes & they are intended to be used along with the various types of
parsers we support which are exposed through the ocrpy.parsers.text.text_parser.TextParser
class.
- class ocrpy.io.reader.DocumentReader(file: str, credentials: Optional[str] = None)#
Reads an image or a pdf file from a local or remote location.
Note: Currently supports Google Storage and Amazon S3 Remote Files.
- filestr
The path to the file to be read.
- credentialsstr
The path to the credentials file.
Note: If the Remote storage is AWS S3, the credentials file must be in the .env format. If the Remote storage is Google Storage, the credentials file must be in the .json format.
- DocumentReader.read() Union[Generator, ByteString] #
Reads the file from a local or remote location and returns the data in byte-string for an image or as a generator of byte-strings for a pdf.
- dataUnion[bytes, List[bytes]]
The data in byte-string for an image or as a generator of byte-strings for a pdf.
- class ocrpy.io.writer.StorageWriter(credentials: Optional[str] = None)#
Write a parser output to a given location (supports write to local storage, S3 and GS).
- credentialsOptional[str]
default : None
The credentials to use for the selected storage location.
Note: If the storage location is AWS S3, the credentials file must be in the .env format and If the storage location is Google Storage, the credentials file must be in the .json format.
- StorageWriter.write(data: Dict, file: str) None #
Write the parser output to a given location (supports write to local storage, S3 and GS).
- dataDict
The data to be written.
- filestr
filename/path to the file to be written.
None
ocrpy.parsers
#
The ocrpy.parsers
module provides a high level interface for parsing text and table from various types of documents
with various types of backends we support. Currently, we support the following parsers:
ocrpy.parsers.text.text_parser.TextParser
- Parses text from various types of documents like Image, Pdf using Tesseract, Aws Textract, Azure and Google Cloud vision APIs.ocrpy.parsers.table.table_parser.TableParser
- Parses table from various types of documents like Image, Pdf using Aws Textract. (Table extraction with Google Cloud vision, Azure and other APIs is not supported yet - will be added soon)
- class ocrpy.parsers.text.text_parser.TextParser(credentials: Optional[str] = None, backend: str = 'pytesseract')#
High level interface for multiple text ocr backends. Note: Currently only supports Pytesseract, Google Cloud Vision and Amazon Textract.
- backendstr
The name of the backend to use. default: “pytesseract” alternative options: “pytesseract”, “aws-textract”, “google-cloud-vision”
- credentialsOptional[str]
The credentials to use for the selected backend. default: None
- TextParser.parse(reader: DocumentReader) Dict #
Parse the data from a given reader. The reader should be an instance of ocrpy.io.reader.DocumentReader.
- readerDocumentReader
The reader to parse the data from.
- dataDict
The parsed data.
- class ocrpy.parsers.table.table_parser.TableParser(credentials: Optional[str] = None, backend: str = 'aws-textract')#
High level interface for multiple table ocr backends. Note: Currently only supports Amazon Textract.
- backendstr
The name of the backend to use. default: “aws-textract”
- credentialsstr
The credentials to use for the selected backend.
- TableParser.parse(reader: DocumentReader, attempt_csv_conversion: bool = False) Union[List, Dict] #
Parse the document and extract the tables.
- readerDocumentReader
The reader to parse the data from.
- attempt_csv_conversionbool
If True, attempt to convert the table to CSV. default: False
tables : Union[List, Dict]
Note: returns a list of lists if attempt_csv_conversion is False. Otherwise, returns a dictionary of pandas dataframes. (each item represents an individual table in a pdf document)
ocrpy.parsers.text
#
- class ocrpy.parsers.text.aws_text.AwsTextOCR(reader: Any, credentials: str)#
AWS Textract OCR Engine
- reader: Any
Reader object that can be used to read the document.
- credentialsstr
Path to credentials file. Note: The credentials file must be in .env format.
- AwsTextOCR.parse() Dict[int, Dict] #
Parses the document and returns the ocr data as a dictionary of pages along with additional metadata.
- parsed_datadict
Dictionary of pages.
- class ocrpy.parsers.text.gcp_text.GcpTextOCR(reader: Any, credentials: str)#
Google Cloud Vision OCR Engine
- readerAny
Reader object that can be used to read the document.
- credentialsstr
Path to credentials file. Note: The credentials file must be in .json format.
- GcpTextOCR.parse()#
Parses the document and returns the ocr data as a dictionary of pages along with additional metadata.
- parsed_datadict
Dictionary of pages.
- class ocrpy.parsers.text.tesseract_text.TesseractTextOCR(reader: Any, credentials=None)#
PyTesseract OCR Engine
- reader: Any
Reader object that can be used to read the document.
- credentialsNone
Credentials for the OCR engine (if any).
- TesseractTextOCR.parse()#
Parses the document and returns the ocr data as a dictionary of pages along with additional metadata.
- parsed_datadict
Dictionary of pages.
ocrpy.parsers.table
#
- class ocrpy.parsers.table.aws_table.AwsTableOCR(reader: Any, credentials: Optional[str] = None)#
AWS Table Parser - This parser uses AWS Textract to analyze the document and extract the tables.
- credentialsstr
Path to credentials file. Note: The credentials file must be in .env format.
- AwsTableOCR.parse() List[List] #
Parse the document and extract the tables.
- documentstr
Path to the document to be parsed.
- tablesList[List]
List of list of tabular data.
- ocrpy.parsers.table.aws_table.table_to_csv(table_data: List[List]) Dict[int, DataFrame] #
Convert the table data to CSV.
- table_dataList[List]
Table data extracted from the document using the parser.
- table_dataDict[int, pd.DataFrame]
Table data in CSV format.
ocrpy.pipelines
#
The ocrpy.pipelines
module provides a set of High level classes that essentially wrap different types of
readers, writers & parser backends and let the user do ocr on collections of documents in either remote or local
storage and write the results to remote or local storage of their choice.
Alternatively, it also lets the users to do ocr on document collections and index the results to a database/search backend of their choice.
- class ocrpy.pipelines.config.PipelineConfig(config_path)#
OCR Pipeline Configuration container.
- config_pathstr
Path to the ocrpy config file.
a sample config file looks like this:
- storage_config:
source_dir: /path/to/source/dir
destination_dir: /path/to/destination/dir
- parser_config:
parser_backend: pytesseract
- cloud_credentials:
aws: /path/to/aws-credentials.env
gcp: /path/to/gcp-credentials.json
- class ocrpy.pipelines.text_pipeline.TextOcrPipeline(source_dir: str, destination_dir: str, parser_backend: str = 'pytesseract', credentials_config=None)#
TextOCRPipeline provides a high level interface to run ocr on PDFs, JPGs, and PNGs in either local or cloud storage(AWS S3 or Google Cloud Storage) with a configurable parser backend.
Note: Supported parser backends are - aws-textract, google-cloud-vision, pytesseract.
- source_dirstr
Path to the directory containing the documents to be processed.
- destination_dirstr
Path to the directory where the processed documents will be stored.
- parser_backendstr
The parser backend to be used for processing. default: “pytesseract” options: “aws-textract”, “google-cloud-vision”, “pytesseract”
- credentials_configdict
A dictionary containing the credentials for the parser or cloud storage backends. default: None example: {“AWS”: “aws-credentials.env”, “GCP”: “gcp-credentials.json”}
- classmethod TextOcrPipeline.from_config(config_path: str) TextOcrPipeline #
Allows you to create a TextOcrPipeline from a config file (config file format: yaml).
- config_pathstr
Path to the config file.
- TextOcrPipeline
A TextOcrPipeline object.
- storage_config:
source_dir: /path/to/source/dir destination_dir: /path/to/destination/dir
- parser_config:
parser_backend: pytesseract
- cloud_credentials:
aws: /path/to/aws-credentials.env gcp: /path/to/gcp-credentials.json
- property TextOcrPipeline.pipeline_config#
- TextOcrPipeline.process() None #
Runs the pipeline with the given configuration and writes the output to the destination directory.
- Returns:
None
- class ocrpy.pipelines.index_pipeline.TextOcrIndexPipeline(source_dir: str, destination_dir: str, parser_backend: str = 'pytesseract', credentials_config=None, database_backend: str = 'sql', database_config: Optional[Dict] = None)#
TextOcrIndexPipeline provides a high level interface to run ocr on PDFs, JPGs, and PNGs in either local or cloud storage(AWS S3 or Google Cloud Storage) with a configurable parser backend and then index the results to a database backend of your choice.
Note: Supported parser backends are - aws-textract, google-cloud-vision, pytesseract & the supported database/search backends are - sql, opensearch & elasticsearch.
- database_backendstr
The database backend to be used for indexing. default: “sql” options: “sql”, “opensearch”, “elasticsearch”
- database_configdict
A dictionary containing the credentials and other params for the database backend. default: None example: {“sql”: {“db_url”: “sqlite:///test.db”, “db_table”: “test”}}
- classmethod TextOcrIndexPipeline.from_config(config_path: str) TextOcrPipeline #
Allows you to create a TextOcrPipeline from a config file (config file format: yaml).
- config_pathstr
Path to the config file.
- TextOcrPipeline
A TextOcrPipeline object.
- storage_config:
source_dir: /path/to/source/dir destination_dir: /path/to/destination/dir
- parser_config:
parser_backend: pytesseract
- cloud_credentials:
aws: /path/to/aws-credentials.env gcp: /path/to/gcp-credentials.json
- property TextOcrIndexPipeline.pipeline_config#
- TextOcrIndexPipeline.process()#
Runs the pipeline with the given configuration and writes the output to the destination directory.
- Returns:
None
ocrpy.experimental
#
The ocrpy.experimental
module contains experimental features that are not yet stable and may change
or be removed in future releases.
Currently it exposes the ocrpy.experimental.document_classifier.DocumentClassifier
class which can be used to classify documents into various categories & ocrpy.experimental.layout_parser.DocumentLayoutParser
class which can be used to identify different components of a document like text, title, table, figures etc.
These can be used along with the ocr pipelines, as preprocessing utils to identify different types of documents and their layout and launch appropriate ocr pipelines for custom processing.
- class ocrpy.experimental.document_classifier.DocumentClassifier(model_name: str = 'microsoft/dit-base-finetuned-rvlcdip')#
DocumentClassifier utilises a fine tuned Document image classifier (DiT) model to predict the document type. The default model is a dit-base model trained on the rvlcdip dataset.
You can also choose alternate models available on HuggingFace modelshub at https://huggingface.co/models.
- model_namestr
The name of the model to use. Should be a valid model name from HuggingFace modelshub.
default: “microsoft/dit-base-finetuned-rvlcdip”
The model is trained on the rvlcdip dataset and can identify the following document types:
letter, form, email, handwritten, advertisement, scientific report, scientific publication, specification, file folder, news article, budget, invoice, presentation, questionnaire, resume, memo
For more information on the model please refer this paper: https://arxiv.org/abs/2203.02378
For more information on the document types, see this link: https://www.cs.cmu.edu/~aharley/rvl-cdip/
- DocumentClassifier.predict(reader: DocumentReader) List #
Predict the document type of the document in the reader.
- readerDocumentReader
The reader containing the document to predict.
- predicted_labels: List
A list of predicted document types.
document types can be one of the following: letter, form, email, handwritten, advertisement, scientific report, scientific publication, specification, file folder, news article, budget, invoice, presentation, questionnaire, resume, memo
- class ocrpy.experimental.layout_parser.DocumentLayoutParser(model_name: str = 'lp://PubLayNet/ppyolov2_r50vd_dcn_365e/config')#
Document Layout Parser utilises a fine tuned PaddleDetection model to detect the layout of the document. The default model is a rcnn based model trained on the publayent dataset.
You can also choose alternate models available on layoutparser modelshub at https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html
- model_namestr
The name of the model to use. Should be a valid model name from HuggingFace modelshub.
default: “lp://PubLayNet/ppyolov2_r50vd_dcn_365e/config”
- The model is trained on the Publaynet dataset and can detect the following blocks from the document:
text, title, list, table, figure
For more information on the dataset please refer this paper: https://arxiv.org/abs/1908.07836
- DocumentLayoutParser.parse(reader: DocumentReader, ocr: TextParser) List #
Predict the document type of the document in the reader.
- readerDocumentReader
The reader containing the document to predict.
- ocrTextParser
The parser to use for OCR.
- ocr_result: dict
Blocks result of the OCR will be updated with the layout information.
block can be following type: text, title, list, table, figure
ocrpy.utils
#
The ocrpy.utils
module provides various helper functions used by the other modules in the package.
- ocrpy.utils.utils.guess_extension(file_path: str) str #
Guesses the file extension of the file.
- file_pathstr
Path to the file.
- extension_typestr
File extension.
- ocrpy.utils.utils.guess_storage(file_path: str) str #
Guesses the storage type of the file.
- file_pathstr
Path to the file.
- storage_typestr
Storage type.