Class AbstractTesseract4OcrEngine

java.lang.Object
com.itextpdf.pdfocr.tesseract4.AbstractTesseract4OcrEngine
All Implemented Interfaces:
IOcrEngine, IProductAware
Direct Known Subclasses:
Tesseract4ExecutableOcrEngine, Tesseract4LibOcrEngine

public abstract class AbstractTesseract4OcrEngine extends Object implements IOcrEngine, IProductAware
The implementation of IOcrEngine.

This class provides possibilities to perform OCR, to read data from input files and to return contained text in the required format. Also, there are possibilities to use features of "tesseract" (optical character recognition engine for various operating systems).

  • Constructor Details

  • Method Details

    • doTesseractOcr

      public void doTesseractOcr (File inputImage, File outputFile, OutputFormat outputFormat)
      Performs tesseract OCR for the first (or for the only) image page.
      Parameters:
      inputImage - input image File
      outputFile - output file for the result for the first page
      outputFormat - selected OutputFormat for tesseract
    • doTesseractOcr

      public void doTesseractOcr (File inputImage, File outputFile, OutputFormat outputFormat, OcrProcessContext ocrProcessContext)
      Performs tesseract OCR for the first (or for the only) image page.
      Parameters:
      inputImage - input image File
      outputFile - output file for the result for the first page
      outputFormat - selected OutputFormat for tesseract
      ocrProcessContext - ocr process context
    • createTxtFile

      public void createTxtFile (List<File> inputImages, File txtFile)
      Performs OCR using provided IOcrEngine for the given list of input images and saves output to a text file using provided path.
      Specified by:
      createTxtFile in interface IOcrEngine
      Parameters:
      inputImages - List of images to be OCRed
      txtFile - file to be created
    • createTxtFile

      public void createTxtFile (List<File> inputImages, File txtFile, OcrProcessContext ocrProcessContext)
      Performs OCR using provided IOcrEngine for the given list of input images and saves output to a text file using provided path.
      Specified by:
      createTxtFile in interface IOcrEngine
      Parameters:
      inputImages - List of images to be OCRed
      txtFile - file to be created
      ocrProcessContext - ocr process context
    • getTesseract4OcrEngineProperties

      public final Tesseract4OcrEngineProperties getTesseract4OcrEngineProperties()
      Gets properties for AbstractTesseract4OcrEngine.
      Returns:
      set properties Tesseract4OcrEngineProperties
    • setTesseract4OcrEngineProperties

      public final void setTesseract4OcrEngineProperties (Tesseract4OcrEngineProperties tesseract4OcrEngineProperties)
      Sets properties for AbstractTesseract4OcrEngine.
      Parameters:
      tesseract4OcrEngineProperties - set of properties Tesseract4OcrEngineProperties for AbstractTesseract4OcrEngine
    • getLanguagesAsString

      public final String getLanguagesAsString()
      Gets list of languages concatenated with "+" symbol to a string in format required by tesseract.
      Returns:
      String of concatenated languages
    • doImageOcr

      public final Map<Integer,List<TextInfo>> doImageOcr (File input)
      Reads data from the provided input image file and returns retrieved data in the format described below.
      Specified by:
      doImageOcr in interface IOcrEngine
      Parameters:
      input - input image File
      Returns:
      Map where key is Integer representing the number of the page and value is List of TextInfo elements where each TextInfo element contains a word or a line and its 4 coordinates(bbox)
    • doImageOcr

      public final Map<Integer,List<TextInfo>> doImageOcr (File input, OcrProcessContext ocrProcessContext)
      Reads data from the provided input image file and returns retrieved data in the format described below.
      Specified by:
      doImageOcr in interface IOcrEngine
      Parameters:
      input - input image File
      ocrProcessContext - ocr process context
      Returns:
      Map where key is Integer representing the number of the page and value is List of TextInfo elements where each TextInfo element contains a word or a line and its 4 coordinates(bbox)
    • doImageOcr

      public final String doImageOcr (File input, OutputFormat outputFormat, OcrProcessContext ocrProcessContext)
      Reads data from the provided input image file and returns retrieved data as string.
      Parameters:
      input - input image File
      outputFormat - return OutputFormat result
      ocrProcessContext - ocr process context
      Returns:
      OCR result as a String that is returned after processing the given image
    • doImageOcr

      public final String doImageOcr (File input, OutputFormat outputFormat)
      Reads data from the provided input image file and returns retrieved data as string.
      Parameters:
      input - input image File
      outputFormat - return OutputFormat result
      Returns:
      OCR result as a String that is returned after processing the given image
    • isWindows

      public boolean isWindows()
      Checks current os type.
      Returns:
      boolean true is current os is windows, otherwise - false
    • identifyOsType

      public String identifyOsType()
      Identifies type of current OS and return it (win, linux).
      Returns:
      type of current os as String
    • validateLanguages

      public void validateLanguages (List<String> languagesList) throws PdfOcrTesseract4Exception
      Validates list of provided languages and checks if they all exist in given tess data directory.
      Parameters:
      languagesList - List of provided languages
      Throws:
      PdfOcrTesseract4Exception - if tess data wasn't found for one of the languages from the provided list
    • getMetaInfoContainer

      public PdfOcrMetaInfoContainer getMetaInfoContainer()
      Gets the container with meta info.
      Specified by:
      getMetaInfoContainer in interface IProductAware
      Returns:
      the held meta info container
    • getProductData

      public com.itextpdf.commons.actions.data.ProductData getProductData()
      Description copied from interface: IProductAware
      Gets object containing information about the product.
      Specified by:
      getProductData in interface IProductAware
      Returns:
      product data
    • isTaggingSupported

      public boolean isTaggingSupported()
      Description copied from interface: IOcrEngine
      Checks whether tagging is supported by the OCR engine.
      Specified by:
      isTaggingSupported in interface IOcrEngine
      Returns:
      true if tagging is supported by the engine, false otherwise