OCR in Python Service: Now with Tesseract Support

Python Service: Now with Tesseract Support

Need to extract text from images or PDF documents directly within your data processing in Integray? You can now do it fully locally – thanks to the new Tesseract OCR support in our Python service.

:pushpin: What does this mean for you?

With the new Tesseract support, you can:

  • Automatically read text from scanned documents, images, or PDF files.
  • Extract text data without relying on cloud services or external APIs.
  • Keep all data processing fully local.

All this is available via the Python Processor connector, which now includes libraries focused on OCR, image handling, and document processing.


:new_button: Newly Added Libraries (OCR and Document Handling)

  • pytesseract – OCR tool to extract text from images.
  • pillow – Image processing (open, edit, save).
  • pdf2image – Convert PDF files to images.
  • opencv-python-headless – OpenCV bindings for server environments (no GUI).
  • tesserocr – Simple wrapper for the Tesseract OCR API.
  • langdetect – Language detection for extracted text.

:white_check_mark: Previously Available Libraries

  • pandas – Powerful structures for data analysis and statistics.
  • numpy – Core library for array computing.
  • PyYAML – YAML parsing and generation.
  • openai – Client for OpenAI API.
  • deepdiff – Compare Python objects deeply.
  • python-jose – JSON Web Token (JWT) implementation.
  • passlib – Secure password hashing.
  • httpx – HTTP client for API calls.
  • matplotlib – Static and animated data visualizations.

:test_tube: Want to try it?

Check out our practical example:
:backhand_index_pointing_right: Tesseract OCR Example

This example demonstrates how to load a PNG image, extract text using the pytesseract library, and automatically detect the language of the recognized text using langdetect. All processing is done fully locally—no external services involved.


:blue_book: Looking for more details?

You can find everything you need about the Python Processor connector here:
:page_facing_up: Python Processor Documentation


:light_bulb: Summary

With OCR libraries now available, Integray empowers you to extract text from unstructured documents entirely offline. Whether you’re processing invoices, scanned contracts, or image-based forms, you now have everything you need—built directly into the Python connector.

Give it a try and let us know what you’d love to see next! :backhand_index_pointing_down: