ReceiptNinja: Utilizing Google Gemini to extract data from Retail Receipts

Constructing ReceiptNinja: An Clever Receipt Processing Demo App

In at present’s digital-first world, managing receipts—whether or not bodily or digital—is usually a daunting activity for people and companies alike. Guide information entry for expense monitoring or finance administration is time-consuming, error-prone, and tedious. Enter ReceiptNinja, an clever demo software designed to automate this course of by extracting key fields from numerous varieties of receipts corresponding to photographs, PDFs, and even bodily copies.

On this article, we’ll information you step-by-step by means of constructing ReceiptNinja, utilizing cutting-edge applied sciences like Google Gemini for its superior language and reasoning capabilities, and Doctr, an open-source optical character recognition (OCR) mannequin. The applying will seamlessly extract and categorize important data, together with retailer identify, date of buy, complete quantity, merchandise record, tax particulars, fee methodology, and reductions.

By the top of this information, you’ll have a totally purposeful demo app that may be simply built-in into private finance instruments or enterprise expense administration programs. Whether or not you’re a developer trying to discover AI-driven functions or a enterprise skilled looking for environment friendly receipt administration options, this tutorial will offer you the sensible instruments and insights to get began.

OCR is Simple, However Discipline Extraction Was a Problem Earlier than LLMs

Optical Character Recognition (OCR) know-how has lengthy been used to transform scanned photographs, PDFs, and different paperwork into machine-readable textual content. With fashionable open-source options like Doctr, OCR has change into simpler than ever, permitting builders to rapidly extract uncooked textual content from numerous sources with minimal setup.

Nonetheless, extracting related fields from receipts, corresponding to the shop identify, date of buy, complete quantity, and even itemized lists, presents a a lot better problem. Earlier than the arrival of Giant Language Fashions (LLMs) and Generative AI (GenAI), fixing this downside required customized options that weren’t scalable. Let’s discover why.

Conventional Approaches: Why They Fell Brief

1. Customized Fashions for Particular Receipt Sorts

One strategy builders took was to coach customized machine studying fashions for particular varieties of receipts. This might contain constructing a mannequin that acknowledges the construction and format of a specific format. For instance, a grocery receipt might need a predictable construction with the shop identify on the prime, adopted by merchandise lists and a complete on the backside. Nonetheless, this strategy required coaching separate fashions for every kind of receipt, as variations in format between retailers, areas, and even receipt generations made it inconceivable to generalize.

Coaching such fashions for all potential receipts is costly, time-consuming, and requires a continuing inflow of knowledge to maintain the fashions updated.

2. Template-Based mostly Options

One other strategy was to make use of template-based matching. Builders would construct static templates for numerous receipts, mapping out the positions of the shop identify, merchandise record, and totals. Whereas this works for well-defined codecs, it fails when the format modifications even barely—be it from a special printer, a brand new model of the receipt format, or an unfamiliar retailer.

The necessity to manually create and preserve templates for each potential variation of receipt format made this resolution non-scalable and fragile.

Enter GenAI: A Scalable Resolution

Due to advances in Generative AI (GenAI) and Giant Language Fashions (LLMs) like Google Gemini, we now have a strong various for dealing with the variability and complexity of receipts. LLMs should not constrained by inflexible codecs or pre-defined templates. As a substitute, they perceive context and semantics, enabling them to extract key fields throughout all kinds of receipt codecs with excessive accuracy.

Let’s dive into the core elements of constructing this software.

Required Libraries:

Pillow: For picture processing.
PyMuPDF (fitz): For dealing with PDFs.
Doctr: For OCR.
Google Generative AI: For subject extraction.

Step 2: Utilizing Doctr for OCR

Step one in processing a receipt is extracting the uncooked textual content utilizing OCR. We’ll make the most of the Doctr library for this activity. The category ImageProcessor contains strategies to course of each picture and PDF recordsdata, convert them to textual content, and improve picture high quality.

Picture and PDF Processing

Photographs are processed utilizing customary libraries corresponding to Pillow, and strategies are included to reinforce sharpness and alter orientation.
PDFs are dealt with utilizing PyMuPDF to transform pages into photographs, that are then processed like another picture.

Right here’s an excerpt from the ImageProcessor class that handles picture and PDF processing:

def process_image(self, image_path): img_original = Picture.open(image_path) # Load the picture # Use Doctr OCR to extract textual content mannequin = ocr_predictor(‘db_resnet50’, pretrained=True, assume_straight_pages=False) doc = DocumentFile.from_images([image_path]) outcome = mannequin(doc) ocr_text = ” “.be part of([word.value for page in result.pages for block in page.blocks for line in block.lines for word in line.words]) return img_original, ocr_text

def process_image(self, image_path):

img_original = Picture.open(image_path) # Load the picture

# Use Doctr OCR to extract textual content

mannequin = ocr_predictor(‘db_resnet50’, pretrained=True, assume_straight_pages=False)

doc = DocumentFile.from_images([image_path])

outcome = mannequin(doc)

ocr_text = ” “.be part of([phrase.worth for web page in outcome.pages for block in web page.blocks for line in block.strains for phrase in line.phrases])

return img_original, ocr_text

Changing PDFs to Photographs:
For PDFs, every web page is transformed into a picture, processed by OCR, after which stitched collectively if crucial.

def convert_pdf_to_images(self, file_path, dpi=300): pdf_document = fitz.open(file_path) photographs = [] for page_number in vary(pdf_document.page_count): web page = pdf_document.load_page(page_number) pix = web page.get_pixmap(matrix=fitz.Matrix(dpi / 72, dpi / 72)) img = Picture.frombytes(“RGB”, [pix.width, pix.height], pix.samples) photographs.append(img) return photographs

def convert_pdf_to_images(self, file_path, dpi=300):

pdf_document = fitz.open(file_path)

photographs = []

for page_number in vary(pdf_document.page_count):

web page = pdf_document.load_page(page_number)

pix = web page.get_pixmap(matrix=fitz.Matrix(dpi / 72, dpi / 72))

img = Picture.frombytes(“RGB”, [pix.width, pix.top], pix.samples)

photographs.append(img)

return photographs

As soon as the OCR textual content is extracted, the following problem is making sense of the info—that is the place Google Gemini is available in.

Step 3: Making use of Google Gemini for Discipline Extraction

The OCR textual content is uncooked and unstructured, however utilizing Google Gemini we are able to extract key fields corresponding to:

Retailer Title
Complete Quantity
Date of Buy
Retailer Handle
Forex
Cost Methodology

Utilizing the Gemini Mannequin

We feed the OCR textual content together with an preliminary immediate into the Google Gemini mannequin, which then processes and extracts related fields in a structured format.

It took us some time to get the immediate proper. Right here is the ultimate immediate. We not solely specify the duty to the mannequin but additionally present a pattern instance:

immediate = ”’1. **Enter:** You’ll obtain a picture containing particulars from a purchasing or meals retailer invoice. The content material could range in format and could also be in several languages. Use your personal inner imaginative and prescient capabilities to precisely extract the related textual content instantly from the picture, with out counting on exterior OCR libraries like Tesseract or another Python-based instruments. Moreover, an OCR (Optical Character Recognition) output might be supplied as a reference. The OCR textual content could include errors or inaccuracies, so your major activity is to make use of your personal imaginative and prescient capabilities to extract the proper particulars instantly from the picture. 2. **Goal:** Your activity is to extract particular particulars from the invoice and return them as a formatted JSON object. Use the precise key names supplied under and be sure that all information is translated to English. If any element is lacking, unclear, or unreadable, observe the error dealing with directions outlined under. 3. **Extraction Guidelines:** – **Retailer Title (“store_name”)**: Extract the complete identify of the shop from which the invoice originates. Make sure the identify is correct and full.(At all times current in picture) – **Retailer Handle (“store_address”)**: Extract the complete handle of the shop, together with road, metropolis, postal code, and nation if accessible. (At all times current in picture) – **Complete Quantity (“total_amount”)**: Extract the overall quantity charged on the invoice. Interpret the foreign money based mostly on the picture and retailer it individually. – **Forex (“foreign money”)**: Extract the foreign money of the overall quantity, which can be in numerous codecs corresponding to symbols (e.g., $, €, RM) or abbreviations (e.g., USD, EUR, MYR). – **Invoice Date (“bill_date”)**: Extract the date of the transaction. Format this as **YYYY-MM-DD**. If the time can be current, embody it within the format **YYYY-MM-DD HH:MM**. – **Cost Methodology (“payment_method”)**: Extract the tactic of fee used (e.g., money, bank card, debit card). If a number of strategies are listed, extract each methodology(e.g., money, bank card, debit card, coupon) that’s used for the transaction. 4. **Error Dealing with:** – If any element can’t be extracted or is unclear, prefix the worth of the related subject with “ERROR:” and embody an evidence of the problem. Instance: “`json { “store_name”: “ERROR: Retailer identify not seen…(extra particulars)”, “store_address”: “123 Instance Avenue, Instance Metropolis, EX 12345, USA”, “total_amount”: 92.50, “foreign money”: “USD”, “bill_date”: “2023-09-01 14:30”, “payment_method”: [“coupon”,”Credit Card”] } “` 5. **OCR Textual content for Reference:** – Use the OCR textual content as a supplementary reference solely. When you can not confidently extract the knowledge from the picture alone, chances are you’ll use the OCR textual content as a touch to information you, however at all times prioritize your personal extraction over the OCR information. – OCR textual content: ”’

2. **Goal:** Your activity is to extract particular particulars from the invoice and return them as a formatted JSON object. Use the precise key names supplied under and be sure that all information is translated to English. If any element is lacking, unclear, or unreadable, observe the error dealing with directions outlined under.

3. **Extraction Guidelines:**

– **Retailer Title (“store_name”)**: Extract the complete identify of the shop from which the invoice originates. Make sure the identify is correct and full.(At all times current in picture)

– **Retailer Handle (“store_address”)**: Extract the complete handle of the shop, together with road, metropolis, postal code, and nation if accessible. (At all times current in picture)

– **Complete Quantity (“total_amount”)**: Extract the overall quantity charged on the invoice. Interpret the foreign money based mostly on the picture and retailer it individually.

– **Forex (“foreign money”)**: Extract the foreign money of the overall quantity, which can be in numerous codecs corresponding to symbols (e.g., $, €, RM) or abbreviations (e.g., USD, EUR, MYR).

– **Invoice Date (“bill_date”)**: Extract the date of the transaction. Format this as **YYYY-MM-DD**. If the time can be current, embody it within the format **YYYY-MM-DD HH:MM**.

– **Cost Methodology (“payment_method”)**: Extract the tactic of fee used (e.g., money, bank card, debit card). If a number of strategies are listed, extract each methodology(e.g., money, bank card, debit card, coupon) that’s used for the transaction.

4. **Error Dealing with:**

– If any element can’t be extracted or is unclear, prefix the worth of the related subject with “ERROR:” and embody an evidence of the problem.

Instance:

“`json

{

“store_name”: “ERROR: Retailer identify not seen…(extra particulars)”,

“store_address”: “123 Instance Avenue, Instance Metropolis, EX 12345, USA”,

“total_amount”: 92.50,

“foreign money”: “USD”,

“bill_date”: “2023-09-01 14:30”,

“payment_method”: [“coupon”,”Credit Card”]

}

“`

5. **OCR Textual content for Reference:**

– Use the OCR textual content as a supplementary reference solely. When you can not confidently extract the knowledge from the picture alone, chances are you’ll use the OCR textual content as a touch to information you, however at all times prioritize your personal extraction over the OCR information.

– OCR textual content:

”’

Full code might be discovered right here https://github.com/sankit1/receipt-ninja

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31