java.lang.Object

com.itextpdf.pdf2data.Pdf2DataExtractor

public class Pdf2DataExtractor extends Object

Pdf2DataExtractor is a class for extracting data from files.

To create instance of Pdf2DataExtractor to extract data from PDF file, use create(File).

To create instance of Pdf2DataExtractor to extract data from image, use create(File, OcrWithPostProcessingEngine).

To extract data from PDF file use extract(File) method.

To extract data from image use extract(File, RecognitionProperties) method with file type specified via RecognitionProperties instance.

Method Summary

Modifier and Type

Method

Description

Map<String,Integer>

check(File targetPDF)

Recognize the pdf file and returns recognition results amount.

Map<String,Integer>

check(File targetFile, RecognitionProperties properties)

Recognize the document and returns recognition results amount.

Map<String,Integer>

check(InputStream targetInputStream)

Recognize the pdf file and returns recognition results amount.

Map<String,Integer>

check(InputStream targetInputStream, RecognitionProperties properties)

Recognize the document and returns recognition results amount.

static Pdf2DataExtractor

create(File p2dFile)

Creates instance of Pdf2DataExtractor from pdf2data template file.

static Pdf2DataExtractor

create(File p2dFile, OcrWithPostProcessingEngine ocrEngine)

Creates instance of Pdf2DataExtractor from pdf2data template file with provided OCR engine.

static Pdf2DataExtractor

createFromTemplateContentJson(InputStream templateContentJsonStream)

Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format.

static Pdf2DataExtractor

createFromTemplateContentJson(InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine)

Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format.

RecognitionResultHolder

extract(File targetPDF)

Recognize the pdf file.

RecognitionResultHolder

extract(File targetFile, RecognitionProperties properties)

Recognize the file.

RecognitionResultHolder

extract(InputStream targetInputStream)

Recognize the pdf file.

RecognitionResultHolder

extract(InputStream targetInputStream, RecognitionProperties properties)

Recognize the file.

com.itextpdf.pdfocr.IOcrEngine

getOcrEngine()

Gets current OCR engine instance.

com.itextpdf.pdf2data.template.Template

getTemplate()

Gets current template instance.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- create
  
  public static Pdf2DataExtractor create (File p2dFile) throws IOException
  
  Creates instance of Pdf2DataExtractor from pdf2data template file. Note that template should be processed.
  
  Parameters:
  
  p2dFile - pdf2data template archive
  
  Returns:
  
  a Pdf2DataExtractor instance
  
  Throws:
  
  IOException - if any I/O exception occurs
  
  com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
- create
  
  public static Pdf2DataExtractor create (File p2dFile, OcrWithPostProcessingEngine ocrEngine) throws IOException
  
  Creates instance of Pdf2DataExtractor from pdf2data template file with provided OCR engine. Note that template should be processed.
  
  Parameters:
  
  p2dFile - pdf2data template archive
  
  ocrEngine - OCR engine to be used for OCR involving recognitions. May be null if no OCR involving recognitions would be used.
  
  Returns:
  
  a Pdf2DataExtractor instance
  
  Throws:
  
  IOException - if any I/O exception occurs
  
  com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
- createFromTemplateContentJson
  
  public static Pdf2DataExtractor createFromTemplateContentJson (InputStream templateContentJsonStream)
  
  Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format. Note that template should be processed.
  
  Parameters:
  
  templateContentJsonStream - processed template content stream
  
  Returns:
  
  a Pdf2DataExtractor instance
  
  Throws:
  
  com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
- createFromTemplateContentJson
  
  public static Pdf2DataExtractor createFromTemplateContentJson (InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine)
  
  Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format. Note that template should be processed.
  
  Parameters:
  
  templateContentJsonStream - processed template content stream
  
  ocrEngine - OCR engine to be used for OCR involving recognitions. May be null if no OCR involving recognitions would be used.
  
  Returns:
  
  a Pdf2DataExtractor instance
  
  Throws:
  
  com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
- getTemplate
  
  public com.itextpdf.pdf2data.template.Template getTemplate()
  
  Gets current template instance.
  
  Returns:
  
  current template instance
- getOcrEngine
  
  public com.itextpdf.pdfocr.IOcrEngine getOcrEngine()
  
  Gets current OCR engine instance.
  
  Returns:
  
  current OCR engine instance.
- extract
  
  public RecognitionResultHolder extract (File targetPDF) throws IOException
  
  Recognize the pdf file.
  
  Parameters:
  
  targetPDF - pdf file for recognition
  
  Returns:
  
  RecognitionResultHolder instance
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- extract
  
  public RecognitionResultHolder extract (File targetFile, RecognitionProperties properties) throws IOException
  
  Recognize the file.
  
  Parameters:
  
  targetFile - file for recognition
  
  properties - a RecognitionProperties instance
  
  Returns:
  
  RecognitionResultHolder instance
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- extract
  
  public RecognitionResultHolder extract (InputStream targetInputStream) throws IOException
  
  Recognize the pdf file.
  
  Parameters:
  
  targetInputStream - input stream from pdf file for recognition
  
  Returns:
  
  RecognitionResultHolder instance
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- extract
  
  public RecognitionResultHolder extract (InputStream targetInputStream, RecognitionProperties properties) throws IOException
  
  Recognize the file.
  
  Parameters:
  
  targetInputStream - input stream from file for recognition
  
  properties - a RecognitionProperties instance
  
  Returns:
  
  RecognitionResultHolder instance
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- check
  
  public Map<String,Integer> check (File targetPDF) throws IOException
  
  Recognize the pdf file and returns recognition results amount.
  
  Parameters:
  
  targetPDF - pdf file for recognition
  
  Returns:
  
  A Map containing the recognition results as key-value pairs of strings and integers.
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- check
  
  public Map<String,Integer> check (File targetFile, RecognitionProperties properties) throws IOException
  
  Recognize the document and returns recognition results amount.
  
  Parameters:
  
  targetFile - file for recognition
  
  properties - a RecognitionProperties instance
  
  Returns:
  
  A Map containing the recognition results as key-value pairs of strings and integers.
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- check
  
  public Map<String,Integer> check (InputStream targetInputStream) throws IOException
  
  Recognize the pdf file and returns recognition results amount.
  
  Parameters:
  
  targetInputStream - input stream from pdf file for recognition
  
  Returns:
  
  A Map containing the recognition results as key-value pairs of strings and integers.
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
- check
  
  public Map<String,Integer> check (InputStream targetInputStream, RecognitionProperties properties) throws IOException
  
  Recognize the document and returns recognition results amount.
  
  Parameters:
  
  targetInputStream - input stream from file for recognition
  
  properties - a RecognitionProperties instance
  
  Returns:
  
  A Map containing the recognition results as key-value pairs of strings and integers.
  
  Throws:
  
  IOException - if any I/O issue occurs
  
  com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted

Class Pdf2DataExtractor

Method Summary

Methods inherited from class java.lang.Object

Method Details

create

create

createFromTemplateContentJson

createFromTemplateContentJson

getTemplate

getOcrEngine

extract

extract

extract

extract

check

check

check

check