• The Financial Bridge
  • Posts
  • Spatial Text Rendering: Pushing the Limits of Spatial Understanding in LLMs

Spatial Text Rendering: Pushing the Limits of Spatial Understanding in LLMs

Vision-Language Models (VLMs) that combine image understanding with text processing have become increasingly common in today’s AI landscape. With their built-in vision encoders, these models can “see” and interpret documents in ways that might make alternative approaches unnecessary. The method we’ll be discussing, Spatial Text Rendering, was developed during a transition period when VLMs were still emerging and not yet robust enough for production-grade financial document processing. We needed to figure out a way to give LLMs a way to “see”.

Financial documents present unique challenges: they contain structured tabular data, vary wildly in format across different institutions (PDFs, Excels, scanned documents, etc), often span hundreds or thousands of pages, and in our case, in the MEA region frequently include mixed language usages like Arabic and English in the same sentences. Arabic however has historically been underserved by machine learning research.

Addressing these challenges is critical because inaccurate or incomplete document processing can lead to flawed financial analyses, regulatory compliance issues, and poor business decisions that impact millions of dollars in investments. To circumvent all these constraints, we needed a solution that was:

  1. Immediately deployable — We couldn’t wait for VLMs to mature

  2. Highly accurate — Financial data demands precision down to the decimal

  3. Language-agnostic — Particularly robust with Arabic, English, and mixed right-to-left and left-to-right scripts

  4. Computationally efficient — Capable of processing 100+ page documents without excessive costs

This meant we had to work on a new approach: what if we could teach text-only Large Language Models (LLMs) to “see” documents by translating visual structure into a format they could understand?

The base method called Spatial Text Rendering (STR), is the result of this challenge. It bridges the gap between visually complex documents and text-only LLMs by preserving the crucial spatial information that gives financial documents their meaning.

In this article, I’ll share how at Abwab.ai we’ve developed this approach, how it works technically, and provide details so others facing similar challenges can get inspiration.

The Document Understanding Challenge

Why Document Understanding Is Hard

Document understanding represents one of the most important challenges in artificial intelligence. Unlike plain text, documents combine content with a spatial structure that conveys meaning. A bank statement isn’t just a collection of text and numbers — it’s a carefully designed layout where the position of information matters as much as the information itself.

For text-only Large Language Models (LLMs), this presents a fundamental limitation: they’re essentially “blind” to the visual and spatial aspects of documents. When fed raw OCR text from a document, these models lose critical contextual information about:

  1. Column and row relationships (particularly important in tables)

  2. Hierarchical structures (titles, headers, etc)

  3. Spatial relationships that imply semantic connections (numbers vertically aligned to a column are implicitly labeled by this column)

  4. Visual emphasis (boxes, shading, font sizes) that conveys importance

This “blindness” is particularly problematic for financial documents (like bank statements, financial statements, etc) which rely heavily on tabular layouts to organize transaction data, balances, and account information.

An Example: The Complexity of Bank Statements

Bank statements are a perfect example to the challenge of document understanding for several reasons:

  1. Layout Diversity: Every bank designs its statements differently, with its own header placements, table formats, and information groupings. Some are simple parsable PDFs, some others are simply scanned and skewed because the institution hasn’t yet been using software for storing their bank statements.

  2. Multi-page Structure: Statements frequently span multiple pages, with running balances and summaries that connect across pages or even rows that span two pages.

  3. Correctness: Missing a single comma in a number character makes the entire bank statement invalid.

Example public bank statements found with a simple Google Images Search. The diversity in format is clear. Credit: scribd.com, coursehero.com, aparat.com

Disclaimer: This article contains only publicly available documents obtained through Google Images Search. No customer data or private information was used in the creation of this content.

Limitations of Traditional Approaches

Before developing our solution, we evaluated several traditional approaches to document understanding:

  • Rule-Based Systems: Creating fixed parsing algorithms for every document type quickly becomes unsustainable given the diversity of formats. These systems break when layouts change even slightly. There are always more templates and layouts than you can ever imagine…

  • Template Matching: While effective for highly standardized documents, templates struggle with variations between statement periods or when banks update their formats.

  • Simple OCR + LLM: Simply extracting text via OCR and feeding it to an LLM loses all structural information, leading to confusion about which numbers represent what values.

The Vision Gap

Text-only LLMs, despite their impressive reasoning capabilities, ultimately face a fundamental limitation: they cannot “see” documents. While they can understand text content very well, they miss the visual structure that humans naturally perceive when looking at a document. Documents are fundamentally made for humans and not machines.

This gap between human document understanding (which is inherently visual) and LLM document processing (which was traditionally limited to text) presented the core challenge we needed to solve: how could we give text-only models the ability to “see” document structure without relying on vision encoders that weren’t yet production-ready for our specific use case?

This challenge led us to develop Spatial Text Rendering, a way to bridge between the visual world of documents and the text-only world of traditional LLMs.

Spatial Text Rendering (STR)

The Evolution to a Structural Approach

Our journey toward an effective document understanding solution began with a fundamental insight: we needed to capture not just the text in a document, but its underlying structure. This realization came from observing how humans process documents where we don’t just read the words: we intuitively understand the relationships between elements based on their spatial arrangement.

The inspiration for our approach came from an interesting observation about LLMs themselves. These models are trained on billions of documents that include HTML, Markdown, JSON, CSV, Mermaid diagrams, and numerous other formats that encode spatial representations of data in structured ways. LLMs have become quite adept at understanding these structured formats because they compress unstructured information into meaningful patterns.

Example HTML and Markdown (text formats) with their respective side-by-side rendered preview (graphical representation).

For example, a simple HTML table:

<table>
  <tr>
    <th>Date</th>
    <th>Amount</th>
    <th>Balance</th>
  </tr>
  <tr>
    <td>2024/04/23</td>
    <td>14,200.00</td>
    <td>14,251.00</td>
  </tr>
</table>

This HTML doesn’t just represent text, it encodes the structural relationship between data elements. LLMs understand these relationships because they’ve seen millions of similar patterns during training.

This HTML table translates to a table with numbers under column names that classify their nature:

Furthermore, studies have shown that LLMs can build spatial models to a certain extent. A study from Yamada, Y et al “Evaluating Spatial Understanding of Large Language Models” shows that LLMs can navigate spaces and build simple spatial maps to a certain extent. LLMs seem to have implicitely built spatial understanding capabilities from their training data.

Lastly, people who are blind or visually impaired use various assistive technologies and adaptive techniques to understand document structure and spatial relationships, including screen readers, tactile graphics, and specialized formats like Braille. These approaches demonstrate that document understanding isn’t exclusively visual but can be effectively achieved through alternative perceptual channels.

Based on all these insights, we realized we could leverage LLMs’ existing structural understanding by creating a text-based format that preserves a document’s spatial layout.

Spatial Text Rendering (STR) builds on this foundation and adds critical improvements by extracting and preserving the document’s structural skeleton. Let’s examine how this process works, starting with the image processing pipeline that makes it possible.

Uncovering Document Structure

Before diving into the technical details, it’s important to understand the high-level goal of our image processing pipeline. We’re essentially trying to:

  1. Straighten the document so that tables and lines are properly aligned

  2. Extract the text and its position using OCR

  3. Identify the underlying skeleton (lines, tables, sections) of the document

  4. Create a grid representation based on the structure found in steps 2 and 3

  5. Project the OCR text to this grid according to its spatial position

  6. Generate a compact text rendering that preserves this structural information while reducing LLM token usage

This process transforms a visual document into a structured text representation that both humans and LLMs can understand. Each step addresses a specific challenge in preserving the document’s spatial organization while optimizing for efficiency.

Original document (left) — Spatial Text Rendering (right)

Now, let’s explore each component of this pipeline in detail.

1. Document Preprocessing and Deskewing

The first step in our pipeline addresses a common issue with scanned documents: skew. When documents are scanned or photographed, they’re rarely perfectly aligned with the horizontal and vertical axes. This misalignment, while minor to human eyes, can significantly confuse algorithms trying to extract structured information.

Our deskewing process is simple and works by:

  • Cleaning the document from artifacts

  • Detecting edges in the document image

  • Identifying dominant line directions using techniques like the good old Hough transforms

  • Calculating the rotation angles needed for correction

  • Applying an affine transformation to straighten the document

Skewed input: slightly rotated and distorted (left) — Corrected document (right)

This critical preprocessing step ensures that the document’s horizontal lines are horizontal and vertical lines are vertical. The document must be aligned from our reference space or more simply said, it should be aligned with the lines of the grid we’ll discuss later which is our reference space. To simplify, you can imagine that the grid is aligned to your computer screen’s borders.

2. Optical Character Recognition (OCR)

With the document properly aligned, we proceed to OCR but rather than simply extracting text, we record the precise spatial coordinates of each text element:

{
  "text": "Bank of Example",
  "boundingPolygon": [
    { "x": 2.35, "y": 1.47 },
    { "x": 7.84, "y": 1.45 },
    { "x": 7.84, "y": 2.12 },
    { "x": 2.35, "y": 2.18 }
  ]
}

These bounding box coordinates provide the raw spatial data that will eventually allow us to position text elements correctly in our rendering. For documents in languages like Arabic, we use OCR engines that understand right-to-left text flow and connected scripts.

3. Structure Extraction: Finding the Document Skeleton

This is where STR differentiates itself from other approaches. Instead of trying to infer document structure from text positions alone, we directly extract the structural skeleton from the document image through a series of image processing steps:

  1. Grayscale Conversion: We convert the color image to grayscale to simplify processing.

  2. Thresholding: We apply binary thresholding to separate foreground (text and lines) from background.

  3. Noise Removal: Using morphological operations like opening, we remove small artifacts and noise that might interfere with structure detection.

  4. Masking: We create a binary mask that focuses on the structural elements while ignoring text content. We use the OCR’d text’s bounding boxes to compute the mask operation.

  5. Cleaning: Additional morphological operations help clean up the binary image to isolate clear structural lines.

  6. Dilation: We strengthen the detected lines to ensure continuity of the structural elements.

You can see these steps below:

The document page goes through multiple steps that finally enable the extraction of its skeleton.

The result is a binary image that highlights the document’s underlying skeleton — the lines, boxes, and dividers that organize information spatially. The remaining artifacts are not important and will be cleaned in the next step.

4. Structural Line Detection: Mapping the Document’s Grid

With our cleaned structural image, we can now detect the horizontal and vertical lines that form the document’s organizational grid:

  1. Horizontal Line Detection: Using horizontal structural elements, we identify the y-coordinates where horizontal lines occur. connect

  2. Vertical Line Detection: Similarly, we detect the x-coordinates of vertical structural lines.

This step connects lines that form important parts of the document and removes lines that are artifacts and noisy leading to a clean skeleton.

Horizontal (blue) and vertical (red) structural lines are extracted and cleaned to be placed on our render grid later on.

Rendering the Document as Text with Structure

With both our OCR text coordinates and our grid skeleton coordinates identified, we now need to implement a similar method to what is done in graphics rendering : rasterization.

A continuous shape (left) is discretized (right) so it can be shown on the pixels of our screens. Credit: Princeton CG Course

Just as graphics rendering in video games converts continuous vector geometry to discrete pixels on a screen, we need to map our OCR text elements (which have precise floating-point coordinates) and the document skeleton which has continuous horizontal and vertical lines onto our discrete grid cells. This rasterization process determines which cell should contain each text element based on its spatial position in the original document.

The grid where we will render the structure of the document and the extracted text

This process involves:

  1. Determining which grid cell should contain which text element based on its bounding box coordinates

  2. Rasterizing the horizontal and vertical lines from the document skeleton onto the grid

The result is a text organized in a manner that preserves the document’s original structure, even when converted to a purely textual format.

That’s Not All There Is To It

Although we’re almost done, there are actually a few more important things to take into account during the rendering.

First, the font size, weight and style of the text from the document we’re extracting is not the same as that from the original document, meaning that there will be some size discrepancies between the original document and the final render as you can see in the image below.

Titles are not as big (blue) — Spacing can be quite different (purple) — Skeleton parts can be sized differently (yellow)

It is usually not an issue for titles as they’re usually sitting alone in a big space, however for all the remaining text on the page, the font style is usually not monospaced, which means that our assumption of fitting one character per grid cell doesn’t work.

Comparison between variable-width fonts and monospaced fonts. Credit: Wikipedia

The original document will have characters of different sizes (variable-width fonts) while we’re transcribing them into a monospace font. The original document could have two characters that are the same size as that of one monospace character and we have to use a monospaced font since our grid has cells of 1x1 size. Implementing a variable-sized grid would be too complex to solve this problem.

Instead, we can change the rasterization process. We need to rearrange elements according to which take priority. For example, we should never render a line on top of text if they are too close, as this will remove text. Instead, we should push the line “pixel” onto the next available cell in the grid while reorganizing everything else to accommodate for this new virtual element that got added to our grid.

This way we’re able to nicely and coherently spatially arrange the text while only having minor distortions in the final render.

Below, you’ll find the Spatial Text Rendering of the bank statement page we’ve been looking at so far:

You'll find the final rendered text here.

Here’s the original document again so you can compare:

Original public document found on Google Images Search. Credit: scribd.com

The Compact Rendering Process

Now that our render is ready, we could input it as is into an LLM but let’s not forget about cost and latency.

LLMs have a limited context size as well as quadratic compute as the standard full attention mechanism in Transformers requires a pairwise comparison of each token in the input sequence making it O(n²) in compute and memory.

To create a token-efficient representation, we generate a compact text rendering that preserves structural relationships while minimizing space used for maximum efficiency.

There are several solutions we can combine to solve this challenge.

1. Empty Space Elimination

We identify and remove empty regions in the document to reduce token usage. The empty region can be an entire line or an entire column full of whitespace characters. This step shouldn’t remove any part from the text or the skeleton but it can reduce the skeleton’s size if needed.

Candidate empty horizontal/vertical lines that can be removed to make the grid more compact.

In the above image, we have two kinds of candidate lines we can remove to make the render more compact:

  • Red lines can safely be removed without impacting the overall render quality.

  • Yellow lines could be removed, however, if they are it becomes dangerous for some of the content. LLMs tend to become confused when the boundary between different text zones is not clear.

2. Consistent Formatting

We standardize the representation of structural elements like horizontal and vertical lines and their intersection to minimize the token count while maintaining visual clarity. In theory, rasterization can render anything onto the grid, even diagonal lines, however, we explicitly want to control what characters are used in the lines so we know exactly how many tokens are used for the skeleton.

Lines that make up the skeleton can only be horizontal or vertical, can intersect and are completely straight with no angle, so we define them as follows:

  • Horizontal lines use the - character only.

  • Vertical lines use the | character only.

  • If any vertical and horizontal line intersect, we represent it with a +which looks like a | and - intersecting.

As for the text, we rely on the OCR’s output which can be in any language, so it’s usually UTF-8 characters.

3. Stretching The Grid

Another solution is to, again, modify the rasterization process and this time we’ll enable the grid to stretch. The advantage is that we can now control how much room we have to render our document.

The more room we have in our grid, the more cells we’ll have in the grid, and it’ll be easier to render the document in a way that resembles the original document’s layout. The other way around, the less room we have, the fewer cells, and the more the document’s elements will move closer to each other which will save the number of tokens in the final render.

However, the rasterization process needs to be smart enough to compress the final render while preserving all the important spatial and textual information as the compaction process can be destructive.

The expansion of the universe also seems to apply to bank statements 🌟

This compaction process can reduce the token count by a huge margin compared to naive spatial text rendering, making it feasible to process even lengthy document pages within the context windows of most LLMs.

By combining image processing techniques with text rendering methods, Spatial Text Rendering bridges the gap between visual document structure and text-only LLM processing. The result is a method that gives “sight” to models that were previously blind to document layout, enabling accurate extraction of financial data from even the most complex statements.

Let’s Make LLMs See

Now that we have our Spatial Text Rendering ready, the next step is to leverage LLMs to extract meaningful information from it. The beauty of STR is that it transforms visual document understanding into a text processing task, which is exactly what LLMs excel at.

Prompting Strategy

When prompting an LLM with an STR-processed document, we need to provide clear instructions that help the model understand what it’s looking at. The prompt structure should include the following points:

  1. Context setting: Explain that the document is a spatial text rendering of a financial document

  2. Task definition: Clearly define what information needs to be extracted

  3. Structure guidance: Help the LLM understand the significance of the structural elements

  4. Output format: Specify the desired format for the extracted information

Here’s the prompt we’ll be using for our later experiments:

The following is a spatially-rendered bank statement that preserves the original document's layout structure. The horizontal lines (-), vertical lines (|), and intersections (+) represent the document's skeleton.

[STR CONTENT GOES HERE]

Please extract in JSON format the following information:
1. Account number
2. Statement date
3. Opening balance
4. Closing balance
5. All transactions with their dates, descriptions, and amounts

Format the transactions as a JSON list with each having its Date, Description, Amount, Direction and Running Balance.

Of course, there are many ways to design the prompt and we could even provide few-shot example extractions if needed. In this article, we’ll be keeping it simple for the purpose of demonstration. Feel free to experiment with your own prompt structures.

Models Comparison

We’ve tested multiple models on the provided Spatial Text Rendering of our example document. You’ll find the results here in a Google Sheets which include the prompt used, remarks, and each model outputs with comparison to the reference. We also provide our own model’s output.

The table shows that all tested LLMs do understand this format very well, be it reasoning models, smaller models and even older models like GPT 3.5 Turbo. This shows how powerful and general this representation is. The models don’t even need to be fine-tuned to work well with this format, however, it is still recommended to fine-tune the LLMs for production usage to enable much higher accuracy and consistency. All LLMs tested were kept at their default settings and the system prompt was left unchanged.

Our internal large and diverse test set shows that fine-tuning an LLM on this format can greatly reduce its error rate and is the preferred way for production usage.

Beyond Extraction: Reasoning and Analysis

What’s particularly interesting is that once an LLM understands the document structure, it can go beyond simple extraction to perform higher-level reasoning.

The model could analyze transaction trends to identify recurring events, unusual transactions or patterns, and classify transactions into useful categories to help provide spending breakdowns.

For example, asking Claude 3.7 about spending patterns in the statement, this is the response we get:

# Spending Analysis

- **Major outflow**: Large STC Pay transaction of 8,000 SAR on June 15
- **International transfers**: 2,440 SAR sent to South Africa via SWIFT
- **Income**: Two salary deposits from Larsen & Toubro totaling 8,498.25 SAR
- **Retail patterns**: Multiple purchases at LULU Hypermarket
- **Digital services**: Regular small STC Pay transactions and one AliExpress purchase (975.40 SAR)
- **Food spending**: Small transactions at restaurants (AL SAEEDI FISH, ALTAZAJ)
- **Transportation**: Careem ride for 50 SAR
- **Cash usage**: One ATM withdrawal of 100 SAR

Net balance decreased from 6,894.90 to 2,733.61 SAR during this period, primarily due to the large STC Pay transaction.

At Abwab.ai , we leverage this enhanced understanding of documents for several applications for optimized lending processes for SMEs. Transaction enrichment allows us to automatically categorize spending into meaningful buckets like utilities, salary, investments, and more, giving a clearer picture of where the money goes. It enables us to generate personalized insights based on an SME’s unique transaction history, highlighting patterns they might not notice themselves. We’re then able to help identify unusual transaction patterns that might indicate potential fraud or financial distress for better lending risk assessment.

These capabilities help transform raw document data into actionable financial intelligence, helping our clients make better financial decisions.

Conclusion

Spatial Text Rendering (STR) offers a practical solution to document understanding challenges that bridges the gap between complex financial documents and text-only LLMs. By preserving spatial relationships through an efficient text format, STR effectively gives a sense of “sight” to models without requiring specialized vision encoders.

At Abwab.ai, we developed this approach to optimize lending processes for financial institutions assessing SME loans. The method excels in multilingual environments like Arabic and English, processes high-volume documents efficiently, and works with existing LLM infrastructure. This gives it a critical advantage when precision and speed in financial data extraction are top priority.

Our testing shows that even without specific fine-tuning, most LLMs, small or big, can effectively process spatially rendered documents, though fine-tuning improves accuracy for production use cases. This suggests LLMs possess more robust structural reasoning abilities than previously thought.

While Vision-Language Models (VLMs) will continue to evolve, Spatial Text Rendering (STR) provides an immediate, practical solution enabling smarter, faster, and more reliable financial document processing.

This article is written by Chady Dimachkie, Head of Machine Learning at Abwab.ai

This article is originally published on Medium here.

Disclaimer: This article contains only publicly available documents obtained through Google Images Search. No customer data or private information was used in the creation of this content.