Tiny OCR
0.9B parameter model running on CPU matched a 50¢/M token cloud API

In our paper reading group last month I presented DeepSeek’s Context Optical Compression paper which claimed amazing performance with a much smaller model (3 billion parameters) than the previous frontier models I had used for OCR (Claude Sonnet and Gemini Flash). Well, last week z.ai released an even smaller (0.9 billion parameters) OCR model called GLM-OCR. The fact that this could cost much less and potentially run faster prompted me to try it out on an old eval. Gemini 3 flash costs around 50 cents per million input tokens compared to 3 cents for GLM’s OCR if you buy it from them (which you don’t have to). On top of that I decided to run it locally on my MacBook Pro M3 MAX's CPU running pytorch (I tried using the Metal GPU but it ran out of memory, so I fell back to CPU). The model ended up taking a long time to run locally (about a minute per page). However, it performed the same as Gemini 3 Flash on a former project.
This eval has OCR as part of a multi-step process to extract entities.
The pipeline extracts entities on a wide variety of documents anything from handwriting to a screenshot of a checkout cart at a restaurant.
Here’s how the pipeline works
Get input document
Convert the input document (PDF, XLS etc) into text (this is the step where i dropped in the new model)
Use an LLM (gpt 4o) to structure the raw text into lists of items
Do database search (RAG — we used HNSW on postgres) for each item
Have the LLM match each item to a specific Database item from the list we looked up
Do a QA step where we compare the original Document to the end list
Heres a summary of the results courtesy of Weights and Biases:
Here are some of the things the eval measured
did we get the correct number of groupings of items (called lists here — both found 3.75 lists per document)
did we get the right items (both had an accuracy of 65%)
was the item bucketed into the correct category (GLM had an accuracy of 76% and gemini had 75%)
As you can see the performance on the evals between the two models were very small. Since I ran GLM-OCR locally it took much longer than gemini (20 minutes for the run vs 4 and a half with gemini).
Should you switch to GLM-OCR? Based on these results, yes - especially if you're processing documents at scale. While my CPU-based local run was 60x slower than Gemini Flash, that's a hardware limitation, not a model limitation. Running GLM-OCR on a proper GPU setup (either self-hosted or through z.ai's API) should match or exceed Gemini's latency while costing 16x less (3¢ vs 50¢ per million tokens). For our use case processing thousands of receipts and invoices monthly, that cost difference is significant. I'm planning to move our production workload to GLM-OCR once I set up proper GPU infrastructure.
One final note, when we initially did this project I was shocked to find that actually the best tools for OCR were LLMs. The fact that now you can match the best OCR performance with a tiny, cheap LLM I find astounding.






