java.lang.Object

com.itextpdf.pdfocr.OcrEngineProperties

com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties

public class Tesseract4OcrEngineProperties extends OcrEngineProperties

Properties that will be used by the IOcrEngine.

Constructor Summary

Constructors

Constructor

Description

Tesseract4OcrEngineProperties()

Creates a new Tesseract4OcrEngineProperties instance.

Tesseract4OcrEngineProperties(Tesseract4OcrEngineProperties other)

Creates a new Tesseract4OcrEngineProperties instance based on another Tesseract4OcrEngineProperties instance (copy constructor).
Method Summary

Modifier and Type

Method

Description

final String

getDefaultLanguage()

Gets default language for ocr.

final String

getDefaultUserWordsSuffix()

Gets default user words suffix.

final ImagePreprocessingOptions

getImagePreprocessingOptions()

Gets imagePreprocessingOptions.

final int

getMinimalConfidenceLevel()

Gets minimal confidence level for HOCR line to be considered as properly recognized.

final Integer

getPageSegMode()

Gets Page Segmentation Mode.

final File

getPathToTessData()

Gets path to directory with tess data.

final TextPositioning

getTextPositioning()

Defines the way text is retrieved from tesseract output using TextPositioning.

final boolean

isPreprocessingImages()

Checks whether image preprocessing is needed.

final boolean

isUseTxtToImproveHocrParsing()

Gets useTxtToImproveHocrParsing.

final Tesseract4OcrEngineProperties

setImagePreprocessingOptions(ImagePreprocessingOptions imagePreprocessingOptions)

Sets imagePreprocessingOptions.

final Tesseract4OcrEngineProperties

setMinimalConfidenceLevel(int minimalConfidenceLevel)

Sets minimal confidence level for HOCR line to be considered as properly recognized.

final Tesseract4OcrEngineProperties

setPageSegMode(Integer mode)

Sets Page Segmentation Mode.

final Tesseract4OcrEngineProperties

setPathToTessData(File tessData)

Sets path to directory with tess data.

final Tesseract4OcrEngineProperties

setPreprocessingImages(boolean preprocess)

Sets true if image preprocessing is needed.

final Tesseract4OcrEngineProperties

setTextPositioning(TextPositioning positioning)

Defines the way text is retrieved from tesseract output using TextPositioning.

final Tesseract4OcrEngineProperties

setUseTxtToImproveHocrParsing(boolean useTxtToImproveHocrParsing)

Sets useTxtToImproveHocrParsing.

Methods inherited from class com.itextpdf.pdfocr.OcrEngineProperties
getLanguages, setLanguages

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- Tesseract4OcrEngineProperties
  
  public Tesseract4OcrEngineProperties()
  
  Creates a new Tesseract4OcrEngineProperties instance.
- Tesseract4OcrEngineProperties
  
  public Tesseract4OcrEngineProperties (Tesseract4OcrEngineProperties other)
  
  Creates a new Tesseract4OcrEngineProperties instance based on another Tesseract4OcrEngineProperties instance (copy constructor).
  
  Parameters:
  
  other - the other Tesseract4OcrEngineProperties instance
Method Details
- getDefaultLanguage
  
  public final String getDefaultLanguage()
  
  Gets default language for ocr.
  
  Returns:
  
  default language - "eng"
- getDefaultUserWordsSuffix
  
  public final String getDefaultUserWordsSuffix()
  
  Gets default user words suffix.
  
  Returns:
  
  default suffix for user words files
- getPathToTessData
  
  public final File getPathToTessData()
  
  Gets path to directory with tess data.
  
  Returns:
  
  path to directory with tess data
- setPathToTessData
  
  public final Tesseract4OcrEngineProperties setPathToTessData (File tessData)
  
  Sets path to directory with tess data.
  
  Parameters:
  
  tessData - path to train directory as File
  
  Returns:
  
  the Tesseract4OcrEngineProperties instance
  
  Throws:
  
  PdfOcrTesseract4Exception - if path to tess data directory is null or empty, or provided directory does not exist? or it is not a directory
- getPageSegMode
  
  public final Integer getPageSegMode()
  
  Gets Page Segmentation Mode.
  
  Returns:
  
  psm mode as Integer
- setPageSegMode
  
  public final Tesseract4OcrEngineProperties setPageSegMode (Integer mode)
  
  Sets Page Segmentation Mode. More detailed explanation about psm modes could be found here https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options Note that in documentation it is stated that default value of PSM is 3. This is true for tesseract executable, but for tesseract lib it is -1 which has negative impact on some documents. That's why in the code we set it explicitly to 3.
  
  Parameters:
  
  mode - psm mode as Integer
  
  Returns:
  
  the Tesseract4OcrEngineProperties instance
- isPreprocessingImages
  
  public final boolean isPreprocessingImages()
  
  Checks whether image preprocessing is needed.
  
  Returns:
  
  true if images need to be preprocessed, otherwise - false
- setPreprocessingImages
  
  public final Tesseract4OcrEngineProperties setPreprocessingImages (boolean preprocess)
  
  Sets true if image preprocessing is needed.
  
  Parameters:
  
  preprocess - true if images need to be preprocessed, otherwise - false
  
  Returns:
  
  the Tesseract4OcrEngineProperties instance
- getTextPositioning
  
  public final TextPositioning getTextPositioning()
  
  Defines the way text is retrieved from tesseract output using TextPositioning.
  
  Returns:
  
  the way text is retrieved
- setTextPositioning
  
  public final Tesseract4OcrEngineProperties setTextPositioning (TextPositioning positioning)
  
  Defines the way text is retrieved from tesseract output using TextPositioning.
  
  Parameters:
  
  positioning - the way text is retrieved
  
  Returns:
  
  the Tesseract4OcrEngineProperties instance
- isUseTxtToImproveHocrParsing
  
  public final boolean isUseTxtToImproveHocrParsing()
  
  Gets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702
  
  Returns:
  
  useTxtToImproveHocrParsing
- setUseTxtToImproveHocrParsing
  
  public final Tesseract4OcrEngineProperties setUseTxtToImproveHocrParsing (boolean useTxtToImproveHocrParsing)
  
  Sets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702
  
  Parameters:
  
  useTxtToImproveHocrParsing - useTxtToImproveHocrParsing
  
  Returns:
  
  this Tesseract4OcrEngineProperties instance.
- getImagePreprocessingOptions
  
  public final ImagePreprocessingOptions getImagePreprocessingOptions()
  
  Gets imagePreprocessingOptions.
  
  Returns:
  
  ImagePreprocessingOptions
- setImagePreprocessingOptions
  
  public final Tesseract4OcrEngineProperties setImagePreprocessingOptions (ImagePreprocessingOptions imagePreprocessingOptions)
  
  Sets imagePreprocessingOptions.
  
  Parameters:
  
  imagePreprocessingOptions - ImagePreprocessingOptions
  
  Returns:
  
  the Tesseract4OcrEngineProperties instance
- getMinimalConfidenceLevel
  
  public final int getMinimalConfidenceLevel()
  
  Gets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100
  
  Returns:
  
  minimal confidence level
- setMinimalConfidenceLevel
  
  public final Tesseract4OcrEngineProperties setMinimalConfidenceLevel (int minimalConfidenceLevel)
  
  Sets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100
  
  Parameters:
  
  minimalConfidenceLevel - minimal confidence level value
  
  Returns:
  
  this Tesseract4OcrEngineProperties instance

Class Tesseract4OcrEngineProperties

Constructor Summary

Method Summary

Methods inherited from class com.itextpdf.pdfocr.OcrEngineProperties

Methods inherited from class java.lang.Object

Constructor Details

Tesseract4OcrEngineProperties

Tesseract4OcrEngineProperties

Method Details

getDefaultLanguage

getDefaultUserWordsSuffix

getPathToTessData

setPathToTessData

getPageSegMode

setPageSegMode

isPreprocessingImages

setPreprocessingImages

getTextPositioning

setTextPositioning

isUseTxtToImproveHocrParsing

setUseTxtToImproveHocrParsing

getImagePreprocessingOptions

setImagePreprocessingOptions

getMinimalConfidenceLevel

setMinimalConfidenceLevel