pdfOCR 1.0.2 API
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties Class Reference

Properties that will be used by the iText.Pdfocr.IOcrEngine. More...

Inheritance diagram for iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties:
iText.Pdfocr.OcrEngineProperties

Public Member Functions

  Tesseract4OcrEngineProperties ()
  Creates a new Tesseract4OcrEngineProperties instance. More...
 
  Tesseract4OcrEngineProperties (iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties other)
  Creates a new Tesseract4OcrEngineProperties instance based on another Tesseract4OcrEngineProperties instance (copy constructor). More...
 
String  GetDefaultLanguage ()
  Gets default language for ocr. More...
 
String  GetDefaultUserWordsSuffix ()
  Gets default user words suffix. More...
 
FileInfo  GetPathToTessData ()
  Gets path to directory with tess data. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetPathToTessData (FileInfo tessData)
  Sets path to directory with tess data. More...
 
int?  GetPageSegMode ()
  Gets Page Segmentation Mode. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetPageSegMode (int? mode)
  Sets Page Segmentation Mode. More...
 
bool  IsPreprocessingImages ()
  Checks whether image preprocessing is needed. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetPreprocessingImages (bool preprocess)
  Sets true if image preprocessing is needed. More...
 
TextPositioning  GetTextPositioning ()
  Defines the way text is retrieved from tesseract output using TextPositioning. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetTextPositioning (TextPositioning positioning)
  Defines the way text is retrieved from tesseract output using TextPositioning. More...
 
bool  IsUseTxtToImproveHocrParsing ()
  Gets useTxtToImproveHocrParsing. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetUseTxtToImproveHocrParsing (bool useTxtToImproveHocrParsing)
  Sets useTxtToImproveHocrParsing. More...
 
ImagePreprocessingOptions  GetImagePreprocessingOptions ()
  Gets imagePreprocessingOptions. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetImagePreprocessingOptions (ImagePreprocessingOptions imagePreprocessingOptions)
  Sets imagePreprocessingOptions. More...
 
int  GetMinimalConfidenceLevel ()
  Gets minimal confidence level for HOCR line to be considered as properly recognized. More...
 
iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  SetMinimalConfidenceLevel (int minimalConfidenceLevel)
  Sets minimal confidence level for HOCR line to be considered as properly recognized. More...
 
- Public Member Functions inherited from iText.Pdfocr.OcrEngineProperties
  OcrEngineProperties ()
  Creates a new OcrEngineProperties instance. More...
 
  OcrEngineProperties (iText.Pdfocr.OcrEngineProperties other)
  Creates a new OcrEngineProperties instance based on another OcrEngineProperties instance (copy constructor). More...
 
IList< String >  GetLanguages ()
  Gets list of languages required for provided images. More...
 
iText.Pdfocr.OcrEngineProperties  SetLanguages (IList< String > requiredLanguages)
  Sets list of languages to be recognized in provided images. More...
 

Detailed Description

Properties that will be used by the iText.Pdfocr.IOcrEngine.

Constructor & Destructor Documentation

◆ Tesseract4OcrEngineProperties() [1/2]

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.Tesseract4OcrEngineProperties ( )
inline

Creates a new Tesseract4OcrEngineProperties instance.

◆ Tesseract4OcrEngineProperties() [2/2]

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.Tesseract4OcrEngineProperties ( iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties  other )
inline

Creates a new Tesseract4OcrEngineProperties instance based on another Tesseract4OcrEngineProperties instance (copy constructor).

Parameters
other the other Tesseract4OcrEngineProperties instance

Member Function Documentation

◆ GetDefaultLanguage()

String iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetDefaultLanguage ( )
inline

Gets default language for ocr.

Returns
default language - "eng"

◆ GetDefaultUserWordsSuffix()

String iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetDefaultUserWordsSuffix ( )
inline

Gets default user words suffix.

Returns
default suffix for user words files

◆ GetImagePreprocessingOptions()

ImagePreprocessingOptions iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetImagePreprocessingOptions ( )
inline

Gets imagePreprocessingOptions.

Returns

ImagePreprocessingOptions

◆ GetMinimalConfidenceLevel()

int iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetMinimalConfidenceLevel ( )
inline

Gets minimal confidence level for HOCR line to be considered as properly recognized.

Gets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100

◆ GetPageSegMode()

int? iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetPageSegMode ( )
inline

Gets Page Segmentation Mode.

Returns
psm mode as int?

◆ GetPathToTessData()

FileInfo iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetPathToTessData ( )
inline

Gets path to directory with tess data.

Returns
path to directory with tess data

◆ GetTextPositioning()

TextPositioning iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.GetTextPositioning ( )
inline

Defines the way text is retrieved from tesseract output using TextPositioning.

Returns
the way text is retrieved

◆ IsPreprocessingImages()

bool iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.IsPreprocessingImages ( )
inline

Checks whether image preprocessing is needed.

Returns
true if images need to be preprocessed, otherwise - false

◆ IsUseTxtToImproveHocrParsing()

bool iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.IsUseTxtToImproveHocrParsing ( )
inline

Gets useTxtToImproveHocrParsing.

Gets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702

Returns

useTxtToImproveHocrParsing

◆ SetImagePreprocessingOptions()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetImagePreprocessingOptions ( ImagePreprocessingOptions  imagePreprocessingOptions )
inline

Sets imagePreprocessingOptions.

Parameters
imagePreprocessingOptions

ImagePreprocessingOptions

Returns
the Tesseract4OcrEngineProperties instance

◆ SetMinimalConfidenceLevel()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetMinimalConfidenceLevel ( int  minimalConfidenceLevel )
inline

Sets minimal confidence level for HOCR line to be considered as properly recognized.

Sets minimal confidence level for HOCR line to be considered as properly recognized. If real confidence level is lower then line is ignored Default value is 0 which means that everything is considered as properly recognized Value may vary in range of 0-100

◆ SetPageSegMode()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetPageSegMode ( int?  mode )
inline

Sets Page Segmentation Mode.

Sets Page Segmentation Mode. More detailed explanation about psm modes could be found here https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#options Note that in documentation it is stated that default value of PSM is 3. This is true for tesseract executable, but for tesseract lib it is -1 which has negative impact on some documents. That's why in the code we set it explicitly to 3.

Parameters
mode psm mode as int?
Returns
the Tesseract4OcrEngineProperties instance

◆ SetPathToTessData()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetPathToTessData ( FileInfo  tessData )
inline

Sets path to directory with tess data.

Parameters
tessData path to train directory as System.IO.FileInfo
Returns
the Tesseract4OcrEngineProperties instance

◆ SetPreprocessingImages()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetPreprocessingImages ( bool  preprocess )
inline

Sets true if image preprocessing is needed.

Parameters
preprocess true if images need to be preprocessed, otherwise - false
Returns
the Tesseract4OcrEngineProperties instance

◆ SetTextPositioning()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetTextPositioning ( TextPositioning  positioning )
inline

Defines the way text is retrieved from tesseract output using TextPositioning.

Parameters
positioning the way text is retrieved
Returns
the Tesseract4OcrEngineProperties instance

◆ SetUseTxtToImproveHocrParsing()

iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties iText.Pdfocr.Tesseract4.Tesseract4OcrEngineProperties.SetUseTxtToImproveHocrParsing ( bool  useTxtToImproveHocrParsing )
inline

Sets useTxtToImproveHocrParsing.

Sets useTxtToImproveHocrParsing. Used to make HOCR recognition result more precise. This is needed for cases of Thai language or some Chinese dialects where every character is interpreted as a single word. For more information see https://github.com/tesseract-ocr/tesseract/issues/2702

Parameters
useTxtToImproveHocrParsing

useTxtToImproveHocrParsing