AWS Textract PDF Table only using Python

Prologue

Optical Character Recognition (OCR) is a widely used technology to extract text from documents, whether they are scanned images, PDFs, or pictures. AWS Textract, a fully managed machine learning service, makes it easier to extract text, forms, and tables from a variety of document types, including PDFs. In this article, we'll walk you through the process of using AWS Textract to perform OCR on a PDF file.

What is AWS Textract?

AWS Textract is an intelligent document processing service that automatically extracts text and data from scanned documents. It uses machine learning to identify and extract text, tables, forms, and even hand-written text from a document.

As said previously, Textract has capability to extract table. Well, it's going to be simple thing right? It is, but it's not really. Naturally, Textract going to extract all of the text within the document. Textract at least had to find out which layout containing table, and then extract it. This diagram below may be give you better understanding.

Prerequisite

Before diving into the OCR process, you’ll need to have the following in place:

AWS Account: If you don’t already have one, sign up for an account at AWS.
AWS CLI or SDK: You will need either the AWS Command Line Interface (CLI) or a Software Development Kit (SDK) to interact with Textract.
S3 Bucket: Textract works with documents stored in AWS S3, so you’ll need to upload your PDF file to an S3 bucket.
Boto3: Boto3 is Python library to interact with AWS environment.
Permissions: Ensure that you have the necessary IAM permissions to use Textract, S3, and related AWS services (TextractFullAccess, S3ReadAccess, etc.).
PDF document: You can create your own PDF file using Microsoft Word, Google Docs, etc, that containing paragraph and table. Example of my PDF test.

Step-by-Step Guide to OCR a PDF using AWS Textract

Step 1: Upload Your PDF to an S3 Bucket

Textract only processes documents from an S3 bucket. Follow these steps to upload your PDF:

Sign in to the AWS Management Console.
Navigate to the S3 service.
Create a new bucket or use an existing one.
Upload your PDF file to the bucket.
Optional, if you're using S3 folder, please define path in the document key.

Step 2: Invoke Textract for PDF Processing

Textract offers several options to extract text, but for OCR, we’ll focus on using the StartDocumentAnalysis API, which extracts raw text from the document.

Create a Python file, for this example I'm gonna name it extractor.py.

import boto3
import time

# Initialize Textract client
textract = boto3.client('textract')
s3 = boto3.client('s3')

In case, you haven't install boto3, just simply type pip install boto3 in your terminal or CMD.

Create a function called start_layout_analysis.

def start_layout_analysis(document_bucket, document_key):
    # Start the Textract job for layout analysis
    response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object': {
                'Bucket': document_bucket,
                'Name': document_key
            }
        },
        FeatureTypes=['LAYOUT', 'TABLES']  # Specify that we want to analyze layout
    )
    return response['JobId']

If you see FeatureTypes in the code, you'll find that we need to define the features. If we only specify LAYOUT, then it's not going to work. Note that this function would return the jobID.

A function above actually is the core of Textract extraction. But since Textract is asynchronous call, we need to make sure that Textract call is either success, in progress, or failed. Now, we create another function to check the status. Note that in this function, I add 5 seconds before we get the response.

def is_textract_job_complete(job_id):
    # Check the status of the Textract job
    response = textract.get_document_analysis(JobId=job_id)
    status = response['JobStatus']
    while status == "IN_PROGRESS":
        time.sleep(5)
        response = textract.get_document_analysis(JobId=job_id)
        status = response['JobStatus']
    return status

After created this function, add another function to get the result of the job.

def get_textract_job_results(job_id):
    # Fetch results from the completed Textract job
    pages = []
    response = textract.get_document_analysis(JobId=job_id)
    pages.append(response)

    # Handle pagination if the document has multiple pages
    next_token = response.get('NextToken')
    while next_token:
        response = textract.get_document_analysis(JobId=job_id, NextToken=next_token)
        pages.append(response)
        next_token = response.get('NextToken')

    return pages

Function above would return the number of pages. Since Textract count the number of pages, it would cost you on how many pages you have.

Now, we need to add two function to extract the table. These two function work together. extract_table_data is to make sure that the table is valid, and get_text_from_cell is the function to get word, inside the cells.

def get_text_from_cell(cell, page):
    # Extract the text from a table cell
    text = ''
    for relationship in cell.get('Relationships', []):
        if relationship['Type'] == 'CHILD':
            for child_id in relationship['Ids']:
                word_block = next((b for b in page['Blocks'] if b['Id'] == child_id), None)
                if word_block and word_block['BlockType'] == 'WORD':
                    text += word_block.get('Text', '') + ' '
    return text.strip()  # Clean up extra spaces


def extract_table_data(pages):
    # Extract table data from Textract results
    tables_data = []
    for page in pages:
        for block in page['Blocks']:
            if block['BlockType'] == 'TABLE':
                table = []
                rows = {}

                # Get each cell in the table
                for relationship in block.get('Relationships', []):
                    # print(relationship)
                    if relationship['Type'] == 'CHILD':
                        for child_id in relationship['Ids']:
                            cell = next(b for b in page['Blocks'] if b['Id'] == child_id)
                            if cell['BlockType'] == 'CELL':
                                row_index = cell['RowIndex']
                                col_index = cell['ColumnIndex']
                                cell_text = get_text_from_cell(cell, page)

                                # Store cell text in rows
                                if row_index not in rows:
                                    rows[row_index] = {}
                                rows[row_index][col_index] = cell_text

                # Create a sorted table by row and column indices
                for row in sorted(rows.keys()):
                    sorted_row = [rows[row].get(col, '') for col in sorted(rows[row].keys())]
                    table.append(sorted_row)

                tables_data.append(table)
    # print(tables_data)
    return tables_data

Now, compile that function into one file, and run in single command.

Run command

python3 extractor.py

Happy codding, and keep exploring! 🍻