From Navigation to Reading: How I Built an Explainable Braille Recognition Pipeline
Research ideas rarely arrive as perfectly planned projects.
At least, mine never do.
Most of the time, they begin with a small observation, an unanswered question, or a problem that keeps bothering me long after everyone else has moved on.
This project started exactly that way.
During my M.Tech, I was simultaneously working on my first research project while contributing to a DST-SERB funded internship focused on assisting visually impaired individuals. The objective was straightforward: help users navigate outdoor environments safely by detecting hazards and generating reliable routes.
I worked on almost every aspect of the project-from dataset annotation and preprocessing to model selection, experimentation, and evaluation. For the first time, I wasn't just training models; I was seeing how a complete research system was built from the ground up.
As exciting as the project was, a question kept appearing in my mind.
The system could help a visually impaired person reach a destination.
But what happened after they arrived?
A navigation system can guide someone to a classroom, library, office, or public building. Yet once they get there, they still need access to information. They may encounter Braille signs, labels, notices, instructions, or documents that need to be read and understood.
Navigation solves movement.
It does not solve information accessibility.
That realization stayed with me.
And eventually it evolved into a simple question:
What would a practical reading assistant for visually impaired individuals actually look like?
That question became the starting point of one of the most rewarding research projects I have worked on.
The First Idea
My initial design was surprisingly simple.
Braille documents contain multiple characters arranged in structured patterns. Since I already had experience working with object detection models, my first instinct was straightforward:
Detect Braille characters.
Convert detections into text.
Correct recognition mistakes using a language model.
Convert the corrected text into speech.
The entire pipeline came together in less than thirty minutes.
Braille Image
↓
YOLO Detection
↓
Text Formation
↓
Transformer Correction
↓
Text-to-Speech
At the time, it felt like a complete solution.
YOLO would handle detection.
A Transformer would handle text correction.
A Text-to-Speech engine would provide audio output.
Simple.
Effective.
Done.
Or so I thought.
The Problem Nobody Talks About
Building a working system and building a research contribution are two completely different things.
A working system solves a problem.
Research must answer a question.
Before writing a significant amount of code, I started reading literature.
A lot of literature.
Paper after paper.
Method after method.
Dataset after dataset.
I wanted to understand how researchers had approached Optical Braille Recognition over the years.
The more I read, the more I noticed a recurring pattern.
Almost every system treated Braille as a classification problem.
Character
↓
Classifier
↓
Prediction
Take a character.
Classify it.
Improve accuracy.
Repeat.
The architectures changed.
The datasets changed.
The performance numbers improved.
But the underlying philosophy remained largely the same.
Then I looked at Braille itself.
And something felt strange.
Looking at Braille Differently
Unlike natural images or handwritten text, Braille follows an extremely rigid structure.
Every Braille symbol consists of six possible positions arranged in a fixed 2×3 grid.
● ○
○ ●
● ○
Every letter.
Every number.
Every symbol.
All emerge from combinations of these six positions.
That observation led me to a question that fundamentally changed the project.
If Braille is fundamentally a structured representation, why are we treating it as a classification problem first?
Why not verify the structure itself?
Why trust a prediction without validating whether the underlying dot arrangement actually makes sense?
That single thought redirected the entire project.
Detection Wasn't Enough
I trained a YOLOv11 detector using a dataset containing more than 13,000 Braille images.
The detector performed well.
Localization accuracy was strong.
Recognition performance was promising.
Most researchers would probably stop there.
I couldn't.
Another question appeared.
What if the detector is confidently wrong?
Not uncertain.
Not borderline.
Wrong.
And absolutely confident about it.
How would I know?
More importantly, how could the system know?
That question eventually led to the first major contribution of the project.
Structural Verification Using a 2×3 Grid
Instead of blindly trusting the detector's output, I designed a verification stage based on the actual geometry of Braille.
Once YOLO detected a character, the corresponding region was cropped and aligned.
The cropped character was then passed through a grid-based verification module inspired by Region Proposal Networks (RPNs).
Instead of proposing arbitrary object regions, this module leveraged the fixed 2×3 structure of Braille.
Each dot position became an interpretable verification point.
The module essentially asked:
Does this detected character actually match the expected structural representation of its predicted class?
The system now had two independent sources of evidence:
YOLO classification
Structural verification
These confidence scores were fused before producing the final prediction.
For the first time, the system wasn't merely recognizing Braille.
It was validating it.
Making the System Explainable
While studying existing literature, another issue became increasingly obvious.
Many papers reported impressive accuracy numbers.
Very few explained failures.
Even fewer attempted to understand them.
That bothered me.
Assistive technologies should not behave like black boxes.
If a system makes a mistake, developers and researchers should understand why.
That realization led to the second major contribution.
Explainability.
I trained an Attention-CNN specifically for generating visual explanations of Braille predictions.
Instead of simply predicting a character, the network highlighted the regions that influenced its decision.
Braille Character
↓
Attention CNN
↓
Attention Map
↓
Visual Explanation
The resulting heatmaps revealed which dots contributed most strongly to a prediction.
When predictions were correct, the highlighted regions aligned with the expected Braille pattern.
When predictions were incorrect, the attention maps often revealed precisely where confusion originated.
The goal wasn't merely improving accuracy.
The goal was understanding failure.
And in research, understanding failure is often just as important as achieving success.
Bringing Everything Together
As the project evolved, the pipeline became significantly more sophisticated than the original idea.
What started as a simple detector gradually transformed into a multi-stage framework focused on reliability, verification, explainability, and accessibility.
The complete workflow is illustrated below.
End-to-End System Workflow
Figure: Sequence diagram illustrating the interaction between detection, structural verification, explainability, language correction, storage, and speech synthesis components.
[Insert Sequence Diagram Here]
The workflow begins when a user uploads or scans a Braille document.
The YOLOv11 detector identifies individual Braille cells and extracts candidate regions.
The structural verification module validates the underlying 2×3 dot arrangement.
The Attention-CNN generates explainability maps and confidence information.
These outputs are combined during post-processing to reconstruct meaningful text sequences.
The reconstructed text is then refined using a Transformer-based correction model before being converted into speech through the Text-to-Speech engine.
Unlike conventional Optical Braille Recognition systems, the architecture continuously accumulates evidence across multiple stages before producing a final output.
Did the Extra Complexity Actually Help?
A natural question emerged.
Was all of this complexity actually useful?
Or was I simply adding more components for the sake of novelty?
The only way to answer that question was through experimentation.
So I conducted extensive ablation studies.
Different architectural choices were evaluated independently to understand the contribution of each component.
One experiment replaced YOLO-based localization with a sliding-window approach.
Braille Image
↓
Sliding Window
↓
Classification
↓
NMS
↓
Prediction
The approach appeared reasonable in theory.
The results told a different story.
The sliding-window system frequently produced:
Duplicate detections
Fragmented character predictions
Lower consistency
Increased false positives
In contrast, the combination of YOLO localization and structural verification consistently generated cleaner predictions and lower error rates.
These experiments demonstrated that the verification stage was solving a genuine problem rather than introducing unnecessary complexity.
Language Understanding Matters Too
Even after successful detection and verification, another challenge remained.
Character-level accuracy was high.
Sentence-level readability was not.
Occasionally the system produced incomplete words or awkward sentence fragments.
To address this issue, I incorporated a Transformer-based language correction stage.
Detected Text
↓
Transformer
↓
Corrected Text
↓
Text-to-Speech
The objective was not to invent new content.
It was simply to repair minor recognition errors and improve readability.
The difference was immediately noticeable.
The output transformed from a sequence of recognized symbols into coherent natural language that sounded correct when spoken aloud.
Only then was it passed to the Text-to-Speech system.
Experimental Resources and Reproducibility
One principle I strongly believe in is research transparency.
A paper should never be the only place where results exist.
For readers interested in exploring the experimental details, implementation artifacts, and supporting analyses, I have made additional resources available.
Source Code
The repository contains:
Model implementations
Training pipelines
Attention visualization modules
Structural verification components
Experimental workflows
Supporting utilities
Ablation Studies and Supplementary Results
The supplementary repository includes:
Ablation studies
Alternative architecture evaluations
Error analyses
Experimental logs
Additional validation results
Intermediate outputs
Looking Back
What started as a simple object detection project gradually evolved into something much larger.
The most important lesson wasn't about YOLO.
It wasn't about Attention-CNNs.
It wasn't even about Transformers.
The most valuable lesson was learning to rethink the problem itself.
Most systems approached Braille as a recognition challenge.
I eventually started viewing it as a structural reasoning problem.
That shift changed everything.
The detector found the characters.
The verifier validated their structure.
The attention model explained the reasoning.
The Transformer improved readability.
The speech engine delivered the information.
Each stage existed for a specific purpose.
And together they formed the foundation of BrailleVision-XAI-an explainable Braille recognition framework designed not only to recognize information, but also to validate, explain, and communicate it.
In the end, the project was never just about reading Braille.
It was about building a system that people could trust.