OCR in Python Service: Now with Tesseract Support

Python Service: Now with Tesseract Support

Need to extract text from images or PDF documents directly within your data processing in Integray? You can now do it fully locally โ€“ thanks to the new Tesseract OCR support in our Python service.

:pushpin: What does this mean for you?

With the new Tesseract support, you can:

  • Automatically read text from scanned documents, images, or PDF files.
  • Extract text data without relying on cloud services or external APIs.
  • Keep all data processing fully local.

All this is available via the Python Processor connector, which now includes libraries focused on OCR, image handling, and document processing.


:new_button: Newly Added Libraries (OCR and Document Handling)

  • pytesseract โ€“ OCR tool to extract text from images.
  • pillow โ€“ Image processing (open, edit, save).
  • pdf2image โ€“ Convert PDF files to images.
  • opencv-python-headless โ€“ OpenCV bindings for server environments (no GUI).
  • tesserocr โ€“ Simple wrapper for the Tesseract OCR API.
  • langdetect โ€“ Language detection for extracted text.

:white_check_mark: Previously Available Libraries

  • pandas โ€“ Powerful structures for data analysis and statistics.
  • numpy โ€“ Core library for array computing.
  • PyYAML โ€“ YAML parsing and generation.
  • openai โ€“ Client for OpenAI API.
  • deepdiff โ€“ Compare Python objects deeply.
  • python-jose โ€“ JSON Web Token (JWT) implementation.
  • passlib โ€“ Secure password hashing.
  • httpx โ€“ HTTP client for API calls.
  • matplotlib โ€“ Static and animated data visualizations.

:test_tube: Want to try it?

Check out our practical example:
:backhand_index_pointing_right: Tesseract OCR Example

This example demonstrates how to load a PNG image, extract text using the pytesseract library, and automatically detect the language of the recognized text using langdetect. All processing is done fully locallyโ€”no external services involved.


:blue_book: Looking for more details?

You can find everything you need about the Python Processor connector here:
:page_facing_up: Python Processor Documentation


:light_bulb: Summary

With OCR libraries now available, Integray empowers you to extract text from unstructured documents entirely offline. Whether youโ€™re processing invoices, scanned contracts, or image-based forms, you now have everything you needโ€”built directly into the Python connector.

Give it a try and let us know what youโ€™d love to see next! :backhand_index_pointing_down: