I Spent Weeks Fighting OCR Before Realizing I Was Solving the Wrong Problem
Heads-up: The figure captions in this article are clickable. Click on any figure caption to view the associated outputs, visualizations, and intermediate results discussed in that section.
Hey guys, it's Manikanta.
About a year ago, I was building a handwriting synthesis system and thought I had everything figured out.
I needed handwritten characters.
OCR reads text.
Problem solved, right?
Not even close.
I ended up spending weeks fighting with preprocessing pipelines, edge detection, and cropping logic before realizing I had completely misunderstood the actual problem.
Ironically, the solution had been suggested to me twice before I finally paid attention.
The Original Goal
The project itself was fairly straightforward.
I wanted to build a system that could learn a person's handwriting style and then generate new text using that style.
To do that, I first needed to collect handwritten samples from the user.
The idea was simple.
Ask the user to write:
A-Z
a-z
0-9
Store those handwritten characters.
Build a character library.
Then later reconstruct arbitrary text using those stored character patterns.
At least on paper, the plan looked simple.
The Obvious Solution
Back then, OCR seemed like the perfect tool.
After all, OCR exists to read text from images.
Why would I build something custom when OCR already solved the problem?
So I started experimenting with OCR-based approaches.
And for a while, everything seemed reasonable.
Until I realized something important.
OCR was solving a different problem than the one I actually had.
The Problem Nobody Notices At First
Most OCR systems are designed to answer one question:
What text is present in this image?
But that wasn't my question.
My question was:
Where is each handwritten character?
That difference sounds small.
It isn't.
OCR wants to understand text.
I wanted to build a database of individual handwritten symbols.
Those goals overlap, but they aren't the same thing.
The First Signs Something Was Wrong
As I started testing, I noticed several issues.
Sometimes characters would be slightly tilted.
Sometimes spacing would vary.
Sometimes parts of characters would get clipped.
Even when OCR produced the correct text, it wasn't giving me what I actually needed.
I didn't want the letter "R".
I wanted the image of that specific handwritten "R".
Those are completely different outputs.
For example, consider the OCR result below.
Figure 1. OCR successfully recognizing handwritten characters.
The OCR system correctly identifies the characters.
Great.
But for my use case, this output was almost useless.
Because I wasn't trying to read the text.
I was trying to extract and preserve the handwritten appearance of each individual character.
OCR was giving me labels.
I needed images.
Trying To Force The Problem
At this point, I convinced myself that I could solve everything with preprocessing.
So I started building increasingly complicated pipelines.
I experimented with:
Thresholding
Edge detection
Contour extraction
Connected component analysis
Manual cropping heuristics
I spent days trying to create rules that could automatically isolate characters.
Sometimes it worked.
Sometimes it failed spectacularly.
Every time I fixed one corner case, another one appeared.
Different handwriting styles broke assumptions.
Different spacing broke assumptions.
Different stroke widths broke assumptions.
The solution kept growing more complicated.
And that's usually a bad sign.
Changing The Data Collection Process
Eventually, I tried attacking the problem from a different angle.
Instead of changing the algorithm, I changed the way data was collected.
I asked users to write characters in a structured format.
Something like:
A-M on one line
N-Z on another line
Lowercase characters separately same as how Uppercase characters were written
Digits separately
The sheet looked something like this.
Figure 2. Structured handwriting collection sheet used during data acquisition.
This definitely helped.
The characters became easier to process.
But I still had the same underlying problem.
I could see the characters.
I still didn't have a reliable way to locate them automatically.
The Question I Ignored Twice
Around this time, two different people asked me exactly the same question.
The first was a classmate working with me on a government-funded research project.
He asked:
"Why aren't you using YOLO?"
I ignored the suggestion.
Not because it was bad.
Mostly because I was comfortable with what I already knew.
A few days later, a professor asked me the same thing.
Again:
"Why aren't you using YOLO?"
My answer was:
"We don't have a YOLO dataset."
At the time, that answer felt completely reasonable.
Looking back, it mostly revealed how little I understood about object detection.
The Bus Ride That Changed Everything
One day while travelling home on a bus, I started watching videos about YOLO.
I wasn't expecting much.
I just wanted to understand why two different people had suggested it.
Then suddenly everything clicked.
The thing I had spent weeks trying to build manually was literally what object detection models were designed to do.
I remember thinking:
Wait...
This thing gives bounding boxes automatically.
That was the moment the entire problem changed in my head.
I Was Solving The Wrong Problem
For weeks, my thinking looked like this:
Image
↓
Crop Characters
↓
Classify
The problem was that the hardest step was the first one.
How do you reliably find every character?
YOLO completely changed that pipeline.
Now it became:
Image
↓
Detect Characters
↓
Get Bounding Boxes
↓
Crop
The localization problem became significantly easier to solve because it was now handled by a model specifically designed for object detection.
And that was exactly what I needed.
Proof That Localization Was The Real Problem
After training a detector, the output immediately looked different.
Instead of trying to read text, the model simply identified where every character existed.
Figure 3. YOLO-based character localization using bounding boxes.
This was the first time I had a reliable way of locating every handwritten character on the page.
And once the location exists, cropping becomes trivial.
Building The Character Library
Once bounding boxes were available, extracting individual characters became easy.
For example:
Figure 4. Individual handwritten character automatically extracted from a detected region.
This crop could then be stored directly in the handwriting library.
The complete pipeline became:
Handwritten Sheet
↓
YOLO Detection
↓
Bounding Boxes
↓
Character Crops
↓
Character Library
↓
Synthetic Text Generation
And suddenly the project started moving again.
What This Project Actually Taught Me
The biggest lesson wasn't YOLO.
The biggest lesson wasn't object detection.
The biggest lesson was understanding the problem correctly.
For weeks I kept asking:
How do I crop handwritten characters?
The better question was:
How do I locate handwritten characters?
Once the question changed, the solution became obvious.
That's a lesson I've carried into almost every project since.
A surprising amount of research isn't about finding a better model.
It's about framing the problem correctly.
Sometimes the bottleneck isn't the architecture.
Sometimes it's the question you're asking.
And in this case, the answer wasn't hidden in a more complicated algorithm.
It was hidden in realizing that I had been solving the wrong problem all along.
Looking Back
Interestingly, this wasn't the last time this idea showed up.
The experience taught me to think about localization and representation separately from classification.
That mindset later influenced my work on fine-grained alphanumeric recognition and eventually my Braille recognition research, where understanding what information should be extracted became just as important as deciding which model to use.
In many ways, those later projects started with lessons learned from this one.
Further Reading
If you're interested in the complete development journey behind the handwriting synthesis system, including dataset preparation, GAN-based generation, experimentation, failures, and architectural decisions, you can read the full project series below.
The Journey Behind My Handwriting Generation Pipeline
Related Projects
Text-Generation
The original handwriting synthesis project where this journey started.
[GitHub](http:// https://github.com/ssb0031/Text-Generation)
Fine-Grained Alphanumeric Recognition
Hybrid YOLOv11 + ConvNeXt-Tiny framework for fine-grained alphanumeric character recognition with SVM-based ambiguity refinement and embedding visualization.
BrailleVision-XAI
An Explainable Multi-Stage Braille Recognition and Decoding Framework.
Sometimes the most valuable outcome of a project isn't the final model.
It's the way it changes how you approach the next problem.