Class Pdf2DataExtractor

java.lang.Object
com.itextpdf.pdf2data.Pdf2DataExtractor

public class Pdf2DataExtractor extends Object
Pdf2DataExtractor is a class for extracting data from files.

To create instance of Pdf2DataExtractor to extract data from PDF file, use create(File).

To create instance of Pdf2DataExtractor to extract data from image, use create(File, OcrWithPostProcessingEngine).

To extract data from PDF file use extract(File) method.

To extract data from image use extract(File, RecognitionProperties) method with file type specified via RecognitionProperties instance.

  • Method Details

    • create

      public static Pdf2DataExtractor create (File p2dFile) throws IOException
      Creates instance of Pdf2DataExtractor from pdf2data template file. Note that template should be processed.
      Parameters:
      p2dFile - pdf2data template archive
      Returns:
      a Pdf2DataExtractor instance
      Throws:
      IOException - if any I/O exception occurs
      com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
    • create

      public static Pdf2DataExtractor create (File p2dFile, OcrWithPostProcessingEngine ocrEngine) throws IOException
      Creates instance of Pdf2DataExtractor from pdf2data template file with provided OCR engine. Note that template should be processed.
      Parameters:
      p2dFile - pdf2data template archive
      ocrEngine - OCR engine to be used for OCR involving recognitions. May be null if no OCR involving recognitions would be used.
      Returns:
      a Pdf2DataExtractor instance
      Throws:
      IOException - if any I/O exception occurs
      com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
    • createFromTemplateContentJson

      public static Pdf2DataExtractor createFromTemplateContentJson (InputStream templateContentJsonStream)
      Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format. Note that template should be processed.
      Parameters:
      templateContentJsonStream - processed template content stream
      Returns:
      a Pdf2DataExtractor instance
      Throws:
      com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
    • createFromTemplateContentJson

      public static Pdf2DataExtractor createFromTemplateContentJson (InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine)
      Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format. Note that template should be processed.
      Parameters:
      templateContentJsonStream - processed template content stream
      ocrEngine - OCR engine to be used for OCR involving recognitions. May be null if no OCR involving recognitions would be used.
      Returns:
      a Pdf2DataExtractor instance
      Throws:
      com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
    • getTemplate

      public com.itextpdf.pdf2data.template.Template getTemplate()
      Gets current template instance.
      Returns:
      current template instance
    • getOcrEngine

      public com.itextpdf.pdfocr.IOcrEngine getOcrEngine()
      Gets current OCR engine instance.
      Returns:
      current OCR engine instance.
    • extract

      public RecognitionResultHolder extract (File targetPDF) throws IOException
      Recognize the pdf file.
      Parameters:
      targetPDF - pdf file for recognition
      Returns:
      RecognitionResultHolder instance
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • extract

      public RecognitionResultHolder extract (File targetFile, RecognitionProperties properties) throws IOException
      Recognize the file.
      Parameters:
      targetFile - file for recognition
      properties - a RecognitionProperties instance
      Returns:
      RecognitionResultHolder instance
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • extract

      public RecognitionResultHolder extract (InputStream targetInputStream) throws IOException
      Recognize the pdf file.
      Parameters:
      targetInputStream - input stream from pdf file for recognition
      Returns:
      RecognitionResultHolder instance
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • extract

      public RecognitionResultHolder extract (InputStream targetInputStream, RecognitionProperties properties) throws IOException
      Recognize the file.
      Parameters:
      targetInputStream - input stream from file for recognition
      properties - a RecognitionProperties instance
      Returns:
      RecognitionResultHolder instance
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • check

      public Map<String,Integer> check (File targetPDF) throws IOException
      Recognize the pdf file and returns recognition results amount.
      Parameters:
      targetPDF - pdf file for recognition
      Returns:
      A Map containing the recognition results as key-value pairs of strings and integers.
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • check

      public Map<String,Integer> check (File targetFile, RecognitionProperties properties) throws IOException
      Recognize the document and returns recognition results amount.
      Parameters:
      targetFile - file for recognition
      properties - a RecognitionProperties instance
      Returns:
      A Map containing the recognition results as key-value pairs of strings and integers.
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • check

      public Map<String,Integer> check (InputStream targetInputStream) throws IOException
      Recognize the pdf file and returns recognition results amount.
      Parameters:
      targetInputStream - input stream from pdf file for recognition
      Returns:
      A Map containing the recognition results as key-value pairs of strings and integers.
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
    • check

      public Map<String,Integer> check (InputStream targetInputStream, RecognitionProperties properties) throws IOException
      Recognize the document and returns recognition results amount.
      Parameters:
      targetInputStream - input stream from file for recognition
      properties - a RecognitionProperties instance
      Returns:
      A Map containing the recognition results as key-value pairs of strings and integers.
      Throws:
      IOException - if any I/O issue occurs
      com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted