Package com.itextpdf.pdfocr.tesseract4
Class Tesseract4OcrEngineProperties
java.lang.Object
com.itextpdf.pdfocr.OcrEngineProperties
com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties
Properties that will be used by the
IOcrEngine
.
-
Constructor Summary
ConstructorDescriptionCreates a newTesseract4OcrEngineProperties
instance.Creates a newTesseract4OcrEngineProperties
instance based on anotherTesseract4OcrEngineProperties
instance (copy constructor). -
Method Summary
Modifier and TypeMethodDescriptionfinal String
Gets default language for ocr.final String
Gets default user words suffix.GetsimagePreprocessingOptions
.final int
Gets minimal confidence level for HOCR line to be considered as properly recognized.final Integer
Gets Page Segmentation Mode.final File
Gets path to directory with tess data.final TextPositioning
Defines the way text is retrieved from tesseract output usingTextPositioning
.final boolean
Checks whether image preprocessing is needed.final boolean
GetsuseTxtToImproveHocrParsing
.setImagePreprocessingOptions
(ImagePreprocessingOptions imagePreprocessingOptions) SetsimagePreprocessingOptions
.setMinimalConfidenceLevel
(int minimalConfidenceLevel) Sets minimal confidence level for HOCR line to be considered as properly recognized.setPageSegMode
(Integer mode) Sets Page Segmentation Mode.setPathToTessData
(File tessData) Sets path to directory with tess data.setPreprocessingImages
(boolean preprocess) Sets true if image preprocessing is needed.setTextPositioning
(TextPositioning positioning) Defines the way text is retrieved from tesseract output usingTextPositioning
.setUseTxtToImproveHocrParsing
(boolean useTxtToImproveHocrParsing) SetsuseTxtToImproveHocrParsing
.Methods inherited from class com.itextpdf.pdfocr.OcrEngineProperties
getLanguages, setLanguages
-
Constructor Details
-
Tesseract4OcrEngineProperties
public Tesseract4OcrEngineProperties()Creates a newTesseract4OcrEngineProperties
instance. -
Tesseract4OcrEngineProperties
Creates a newTesseract4OcrEngineProperties
instance based on anotherTesseract4OcrEngineProperties
instance (copy constructor).- Parameters:
-
other
- the otherTesseract4OcrEngineProperties
instance
-
-
Method Details
-
getDefaultLanguage
Gets default language for ocr.- Returns:
- default language - "eng"
-
getDefaultUserWordsSuffix
Gets default user words suffix.- Returns:
- default suffix for user words files
-
getPathToTessData
Gets path to directory with tess data.- Returns:
- path to directory with tess data
-
setPathToTessData
Sets path to directory with tess data.- Parameters:
-
tessData
- path to train directory asFile
- Returns:
-
the
Tesseract4OcrEngineProperties
instance - Throws:
-
PdfOcrTesseract4Exception
- if path to tess data directory is null or empty, or provided directory does not exist? or it is not a directory
-
getPageSegMode
Gets Page Segmentation Mode.- Returns:
-
psm mode as
Integer
-
setPageSegMode
Sets Page Segmentation Mode. More detailed explanation about psm modes could be found here https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options Note that in documentation it is stated that default value of PSM is 3. This is true for tesseract executable, but for tesseract lib it is -1 which has negative impact on some documents. That's why in the code we set it explicitly to 3.- Parameters:
-
mode
- psm mode asInteger
- Returns:
-
the
Tesseract4OcrEngineProperties
instance
-
isPreprocessingImages
public final boolean isPreprocessingImages()Checks whether image preprocessing is needed.- Returns:
- true if images need to be preprocessed, otherwise - false
-
setPreprocessingImages
Sets true if image preprocessing is needed.- Parameters:
-
preprocess
- true if images need to be preprocessed, otherwise - false - Returns:
-
the
Tesseract4OcrEngineProperties
instance
-
getTextPositioning
Defines the way text is retrieved from tesseract output usingTextPositioning
.- Returns:
- the way text is retrieved
-
setTextPositioning
Defines the way text is retrieved from tesseract output usingTextPositioning
.- Parameters:
-
positioning
- the way text is retrieved - Returns:
-
the
Tesseract4OcrEngineProperties
instance
-
isUseTxtToImproveHocrParsing
public final boolean isUseTxtToImproveHocrParsing()GetsuseTxtToImproveHocrParsing
. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702- Returns:
-
useTxtToImproveHocrParsing
-
setUseTxtToImproveHocrParsing
public final Tesseract4OcrEngineProperties setUseTxtToImproveHocrParsing(boolean useTxtToImproveHocrParsing) SetsuseTxtToImproveHocrParsing
. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702- Parameters:
-
useTxtToImproveHocrParsing
-useTxtToImproveHocrParsing
- Returns:
-
this
Tesseract4OcrEngineProperties
instance.
-
getImagePreprocessingOptions
GetsimagePreprocessingOptions
.- Returns:
-
ImagePreprocessingOptions
-
setImagePreprocessingOptions
public final Tesseract4OcrEngineProperties setImagePreprocessingOptions(ImagePreprocessingOptions imagePreprocessingOptions) SetsimagePreprocessingOptions
.- Parameters:
-
imagePreprocessingOptions
-ImagePreprocessingOptions
- Returns:
-
the
Tesseract4OcrEngineProperties
instance
-
getMinimalConfidenceLevel
public final int getMinimalConfidenceLevel()Gets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100- Returns:
- minimal confidence level
-
setMinimalConfidenceLevel
Sets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100- Parameters:
-
minimalConfidenceLevel
- minimal confidence level value - Returns:
-
this
Tesseract4OcrEngineProperties
instance
-