Package com.itextpdf.pdfocr.tesseract4
Class Tesseract4OcrEngineProperties
java.lang.Object
com.itextpdf.pdfocr.OcrEngineProperties
com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties
Properties that will be used by the
IOcrEngine.
-
Constructor Summary
ConstructorsConstructorDescriptionCreates a newTesseract4OcrEnginePropertiesinstance.Creates a newTesseract4OcrEnginePropertiesinstance based on anotherTesseract4OcrEnginePropertiesinstance (copy constructor). -
Method Summary
Modifier and TypeMethodDescriptionfinal StringGets default language for ocr.final StringGets default user words suffix.GetsimagePreprocessingOptions.final intGets minimal confidence level for HOCR line to be considered as properly recognized.final IntegerGets Page Segmentation Mode.final FileGets path to directory with tess data.final TextPositioningDefines the way text is retrieved from tesseract output usingTextPositioning.final booleanChecks whether image preprocessing is needed.final booleanGetsuseTxtToImproveHocrParsing.setImagePreprocessingOptions(ImagePreprocessingOptions imagePreprocessingOptions) SetsimagePreprocessingOptions.setMinimalConfidenceLevel(int minimalConfidenceLevel) Sets minimal confidence level for HOCR line to be considered as properly recognized.setPageSegMode(Integer mode) Sets Page Segmentation Mode.setPathToTessData(File tessData) Sets path to directory with tess data.setPreprocessingImages(boolean preprocess) Sets true if image preprocessing is needed.setTextPositioning(TextPositioning positioning) Defines the way text is retrieved from tesseract output usingTextPositioning.setUseTxtToImproveHocrParsing(boolean useTxtToImproveHocrParsing) SetsuseTxtToImproveHocrParsing.Methods inherited from class com.itextpdf.pdfocr.OcrEngineProperties
getLanguages, setLanguages
-
Constructor Details
-
Tesseract4OcrEngineProperties
public Tesseract4OcrEngineProperties()Creates a newTesseract4OcrEnginePropertiesinstance. -
Tesseract4OcrEngineProperties
Creates a newTesseract4OcrEnginePropertiesinstance based on anotherTesseract4OcrEnginePropertiesinstance (copy constructor).- Parameters:
-
other- the otherTesseract4OcrEnginePropertiesinstance
-
-
Method Details
-
getDefaultLanguage
Gets default language for ocr.- Returns:
- default language - "eng"
-
getDefaultUserWordsSuffix
Gets default user words suffix.- Returns:
- default suffix for user words files
-
getPathToTessData
Gets path to directory with tess data.- Returns:
- path to directory with tess data
-
setPathToTessData
Sets path to directory with tess data.- Parameters:
-
tessData- path to train directory asFile - Returns:
-
the
Tesseract4OcrEnginePropertiesinstance - Throws:
-
PdfOcrTesseract4Exception- if path to tess data directory is null or empty, or provided directory does not exist? or it is not a directory
-
getPageSegMode
Gets Page Segmentation Mode.- Returns:
-
psm mode as
Integer
-
setPageSegMode
Sets Page Segmentation Mode. More detailed explanation about psm modes could be found here https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options Note that in documentation it is stated that default value of PSM is 3. This is true for tesseract executable, but for tesseract lib it is -1 which has negative impact on some documents. That's why in the code we set it explicitly to 3.- Parameters:
-
mode- psm mode asInteger - Returns:
-
the
Tesseract4OcrEnginePropertiesinstance
-
isPreprocessingImages
public final boolean isPreprocessingImages()Checks whether image preprocessing is needed.- Returns:
- true if images need to be preprocessed, otherwise - false
-
setPreprocessingImages
Sets true if image preprocessing is needed.- Parameters:
-
preprocess- true if images need to be preprocessed, otherwise - false - Returns:
-
the
Tesseract4OcrEnginePropertiesinstance
-
getTextPositioning
Defines the way text is retrieved from tesseract output usingTextPositioning.- Returns:
- the way text is retrieved
-
setTextPositioning
Defines the way text is retrieved from tesseract output usingTextPositioning.- Parameters:
-
positioning- the way text is retrieved - Returns:
-
the
Tesseract4OcrEnginePropertiesinstance
-
isUseTxtToImproveHocrParsing
public final boolean isUseTxtToImproveHocrParsing()GetsuseTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702- Returns:
-
useTxtToImproveHocrParsing
-
setUseTxtToImproveHocrParsing
public final Tesseract4OcrEngineProperties setUseTxtToImproveHocrParsing(boolean useTxtToImproveHocrParsing) SetsuseTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702- Parameters:
-
useTxtToImproveHocrParsing-useTxtToImproveHocrParsing - Returns:
-
this
Tesseract4OcrEnginePropertiesinstance.
-
getImagePreprocessingOptions
GetsimagePreprocessingOptions.- Returns:
-
ImagePreprocessingOptions
-
setImagePreprocessingOptions
public final Tesseract4OcrEngineProperties setImagePreprocessingOptions(ImagePreprocessingOptions imagePreprocessingOptions) SetsimagePreprocessingOptions.- Parameters:
-
imagePreprocessingOptions-ImagePreprocessingOptions - Returns:
-
the
Tesseract4OcrEnginePropertiesinstance
-
getMinimalConfidenceLevel
public final int getMinimalConfidenceLevel()Gets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100- Returns:
- minimal confidence level
-
setMinimalConfidenceLevel
Sets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100- Parameters:
-
minimalConfidenceLevel- minimal confidence level value - Returns:
-
this
Tesseract4OcrEnginePropertiesinstance
-