Class Tesseract4OcrEngineProperties

java.lang.Object
com.itextpdf.pdfocr.OcrEngineProperties
com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties

public class Tesseract4OcrEngineProperties extends OcrEngineProperties
Properties that will be used by the IOcrEngine.
  • Constructor Details

  • Method Details

    • getDefaultLanguage

      public final String getDefaultLanguage()
      Gets default language for ocr.
      Returns:
      default language - "eng"
    • getDefaultUserWordsSuffix

      public final String getDefaultUserWordsSuffix()
      Gets default user words suffix.
      Returns:
      default suffix for user words files
    • getPathToTessData

      public final File getPathToTessData()
      Gets path to directory with tess data.
      Returns:
      path to directory with tess data
    • setPathToTessData

      public final Tesseract4OcrEngineProperties setPathToTessData (File tessData)
      Sets path to directory with tess data.
      Parameters:
      tessData - path to train directory as File
      Returns:
      the Tesseract4OcrEngineProperties instance
      Throws:
      PdfOcrTesseract4Exception - if path to tess data directory is null or empty, or provided directory does not exist? or it is not a directory
    • getPageSegMode

      public final Integer getPageSegMode()
      Gets Page Segmentation Mode.
      Returns:
      psm mode as Integer
    • setPageSegMode

      public final Tesseract4OcrEngineProperties setPageSegMode (Integer mode)
      Sets Page Segmentation Mode. More detailed explanation about psm modes could be found here https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options Note that in documentation it is stated that default value of PSM is 3. This is true for tesseract executable, but for tesseract lib it is -1 which has negative impact on some documents. That's why in the code we set it explicitly to 3.
      Parameters:
      mode - psm mode as Integer
      Returns:
      the Tesseract4OcrEngineProperties instance
    • isPreprocessingImages

      public final boolean isPreprocessingImages()
      Checks whether image preprocessing is needed.
      Returns:
      true if images need to be preprocessed, otherwise - false
    • setPreprocessingImages

      public final Tesseract4OcrEngineProperties setPreprocessingImages (boolean preprocess)
      Sets true if image preprocessing is needed.
      Parameters:
      preprocess - true if images need to be preprocessed, otherwise - false
      Returns:
      the Tesseract4OcrEngineProperties instance
    • getTextPositioning

      public final TextPositioning getTextPositioning()
      Defines the way text is retrieved from tesseract output using TextPositioning.
      Returns:
      the way text is retrieved
    • setTextPositioning

      public final Tesseract4OcrEngineProperties setTextPositioning (TextPositioning positioning)
      Defines the way text is retrieved from tesseract output using TextPositioning.
      Parameters:
      positioning - the way text is retrieved
      Returns:
      the Tesseract4OcrEngineProperties instance
    • isUseTxtToImproveHocrParsing

      public final boolean isUseTxtToImproveHocrParsing()
      Gets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702
      Returns:
      useTxtToImproveHocrParsing
    • setUseTxtToImproveHocrParsing

      public final Tesseract4OcrEngineProperties setUseTxtToImproveHocrParsing (boolean useTxtToImproveHocrParsing)
      Sets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702
      Parameters:
      useTxtToImproveHocrParsing - useTxtToImproveHocrParsing
      Returns:
      this Tesseract4OcrEngineProperties instance.
    • getImagePreprocessingOptions

      public final ImagePreprocessingOptions getImagePreprocessingOptions()
      Gets imagePreprocessingOptions.
      Returns:
      ImagePreprocessingOptions
    • setImagePreprocessingOptions

      public final Tesseract4OcrEngineProperties setImagePreprocessingOptions (ImagePreprocessingOptions imagePreprocessingOptions)
      Sets imagePreprocessingOptions.
      Parameters:
      imagePreprocessingOptions - ImagePreprocessingOptions
      Returns:
      the Tesseract4OcrEngineProperties instance
    • getMinimalConfidenceLevel

      public final int getMinimalConfidenceLevel()
      Gets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100
      Returns:
      minimal confidence level
    • setMinimalConfidenceLevel

      public final Tesseract4OcrEngineProperties setMinimalConfidenceLevel (int minimalConfidenceLevel)
      Sets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100
      Parameters:
      minimalConfidenceLevel - minimal confidence level value
      Returns:
      this Tesseract4OcrEngineProperties instance