Tesseract4OcrEngineProperties (pdfOCR 2.0.2 API)

java.lang.Object
- com.itextpdf.pdfocr.OcrEngineProperties
- - com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties

public class Tesseract4OcrEngineProperties
extends OcrEngineProperties

Properties that will be used by the IOcrEngine.

Constructor Summary

Constructors
Constructor and Description
`Tesseract4OcrEngineProperties()` Creates a new `Tesseract4OcrEngineProperties` instance.
`Tesseract4OcrEngineProperties(Tesseract4OcrEngineProperties other)` Creates a new `Tesseract4OcrEngineProperties` instance based on another `Tesseract4OcrEngineProperties` instance (copy constructor).

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`String`	`getDefaultLanguage()` Gets default language for ocr.
`String`	`getDefaultUserWordsSuffix()` Gets default user words suffix.
`ImagePreprocessingOptions`	`getImagePreprocessingOptions()` Gets `imagePreprocessingOptions`.
`int`	`getMinimalConfidenceLevel()` Gets minimal confidence level for HOCR line to be considered as properly recognized.
`Integer`	`getPageSegMode()` Gets Page Segmentation Mode.
`File`	`getPathToTessData()` Gets path to directory with tess data.
`TextPositioning`	`getTextPositioning()` Defines the way text is retrieved from tesseract output using `TextPositioning`.
`boolean`	`isPreprocessingImages()` Checks whether image preprocessing is needed.
`boolean`	`isUseTxtToImproveHocrParsing()` Gets `useTxtToImproveHocrParsing`.
`Tesseract4OcrEngineProperties`	`setImagePreprocessingOptions(ImagePreprocessingOptions imagePreprocessingOptions)` Sets `imagePreprocessingOptions`.
`Tesseract4OcrEngineProperties`	`setMinimalConfidenceLevel(int minimalConfidenceLevel)` Sets minimal confidence level for HOCR line to be considered as properly recognized.
`Tesseract4OcrEngineProperties`	`setPageSegMode(Integer mode)` Sets Page Segmentation Mode.
`Tesseract4OcrEngineProperties`	`setPathToTessData(File tessData)` Sets path to directory with tess data.
`Tesseract4OcrEngineProperties`	`setPreprocessingImages(boolean preprocess)` Sets true if image preprocessing is needed.
`Tesseract4OcrEngineProperties`	`setTextPositioning(TextPositioning positioning)` Defines the way text is retrieved from tesseract output using `TextPositioning`.
`Tesseract4OcrEngineProperties`	`setUseTxtToImproveHocrParsing(boolean useTxtToImproveHocrParsing)` Sets `useTxtToImproveHocrParsing`.

Methods inherited from class com.itextpdf.pdfocr.OcrEngineProperties
getLanguages, setLanguages

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - Tesseract4OcrEngineProperties
```
public Tesseract4OcrEngineProperties()
```
    Creates a new Tesseract4OcrEngineProperties instance.
  - Tesseract4OcrEngineProperties
```
public Tesseract4OcrEngineProperties(Tesseract4OcrEngineProperties other)
```
    Creates a new Tesseract4OcrEngineProperties instance based on another Tesseract4OcrEngineProperties instance (copy constructor).
    
    Parameters:
    
    other - the other Tesseract4OcrEngineProperties instance
- Method Detail
  - getDefaultLanguage
```
public final String getDefaultLanguage()
```
    Gets default language for ocr.
    
    Returns:
    
    default language - "eng"
  - getDefaultUserWordsSuffix
```
public final String getDefaultUserWordsSuffix()
```
    Gets default user words suffix.
    
    Returns:
    
    default suffix for user words files
  - getPathToTessData
```
public final File getPathToTessData()
```
    Gets path to directory with tess data.
    
    Returns:
    
    path to directory with tess data
  - setPathToTessData
```
public final Tesseract4OcrEngineProperties setPathToTessData(File tessData)
```
    Sets path to directory with tess data.
    
    Parameters:
    
    tessData - path to train directory as File
    
    Returns:
    
    the Tesseract4OcrEngineProperties instance
    
    Throws:
    
    PdfOcrTesseract4Exception - if path to tess data directory is null or empty, or provided directory does not exist? or it is not a directory
  - getPageSegMode
```
public final Integer getPageSegMode()
```
    Gets Page Segmentation Mode.
    
    Returns:
    
    psm mode as Integer
  - setPageSegMode
```
public final Tesseract4OcrEngineProperties setPageSegMode(Integer mode)
```
    Sets Page Segmentation Mode. More detailed explanation about psm modes could be found here https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options Note that in documentation it is stated that default value of PSM is 3. This is true for tesseract executable, but for tesseract lib it is -1 which has negative impact on some documents. That's why in the code we set it explicitly to 3.
    
    Parameters:
    
    mode - psm mode as Integer
    
    Returns:
    
    the Tesseract4OcrEngineProperties instance
  - isPreprocessingImages
```
public final boolean isPreprocessingImages()
```
    Checks whether image preprocessing is needed.
    
    Returns:
    
    true if images need to be preprocessed, otherwise - false
  - setPreprocessingImages
```
public final Tesseract4OcrEngineProperties setPreprocessingImages(boolean preprocess)
```
    Sets true if image preprocessing is needed.
    
    Parameters:
    
    preprocess - true if images need to be preprocessed, otherwise - false
    
    Returns:
    
    the Tesseract4OcrEngineProperties instance
  - getTextPositioning
```
public final TextPositioning getTextPositioning()
```
    Defines the way text is retrieved from tesseract output using TextPositioning.
    
    Returns:
    
    the way text is retrieved
  - setTextPositioning
```
public final Tesseract4OcrEngineProperties setTextPositioning(TextPositioning positioning)
```
    Defines the way text is retrieved from tesseract output using TextPositioning.
    
    Parameters:
    
    positioning - the way text is retrieved
    
    Returns:
    
    the Tesseract4OcrEngineProperties instance
  - isUseTxtToImproveHocrParsing
```
public final boolean isUseTxtToImproveHocrParsing()
```
    Gets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702
    
    Returns:
    
    useTxtToImproveHocrParsing
  - setUseTxtToImproveHocrParsing
```
public final Tesseract4OcrEngineProperties setUseTxtToImproveHocrParsing(boolean useTxtToImproveHocrParsing)
```
    Sets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702
    
    Parameters:
    
    useTxtToImproveHocrParsing - useTxtToImproveHocrParsing
    
    Returns:
    
    this Tesseract4OcrEngineProperties instance.
  - getImagePreprocessingOptions
```
public final ImagePreprocessingOptions getImagePreprocessingOptions()
```
    Gets imagePreprocessingOptions.
    
    Returns:
    
    ImagePreprocessingOptions
  - setImagePreprocessingOptions
```
public final Tesseract4OcrEngineProperties setImagePreprocessingOptions(ImagePreprocessingOptions imagePreprocessingOptions)
```
    Sets imagePreprocessingOptions.
    
    Parameters:
    
    imagePreprocessingOptions - ImagePreprocessingOptions
    
    Returns:
    
    the Tesseract4OcrEngineProperties instance
  - getMinimalConfidenceLevel
```
public final int getMinimalConfidenceLevel()
```
    Gets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100
    
    Returns:
    
    minimal confidence level
  - setMinimalConfidenceLevel
```
public final Tesseract4OcrEngineProperties setMinimalConfidenceLevel(int minimalConfidenceLevel)
```
    Sets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100
    
    Parameters:
    
    minimalConfidenceLevel - minimal confidence level value
    
    Returns:
    
    this Tesseract4OcrEngineProperties instance

Class Tesseract4OcrEngineProperties

Constructor Summary

Method Summary

Methods inherited from class com.itextpdf.pdfocr.OcrEngineProperties

Methods inherited from class java.lang.Object

Constructor Detail

Tesseract4OcrEngineProperties

Tesseract4OcrEngineProperties

Method Detail

getDefaultLanguage

getDefaultUserWordsSuffix

getPathToTessData

setPathToTessData

getPageSegMode

setPageSegMode

isPreprocessingImages

setPreprocessingImages

getTextPositioning

setTextPositioning

isUseTxtToImproveHocrParsing

setUseTxtToImproveHocrParsing

getImagePreprocessingOptions

setImagePreprocessingOptions

getMinimalConfidenceLevel

setMinimalConfidenceLevel