Skip to main content

Command Palette

Search for a command to run...

Tiny OCR

0.9B parameter model running on CPU matched a 50¢/M token cloud API

Updated
3 min read
Tiny OCR

In our paper reading group last month I presented DeepSeek’s Context Optical Compression paper which claimed amazing performance with a much smaller model (3 billion parameters) than the previous frontier models I had used for OCR (Claude Sonnet and Gemini Flash). Well, last week z.ai released an even smaller (0.9 billion parameters) OCR model called GLM-OCR. The fact that this could cost much less and potentially run faster prompted me to try it out on an old eval. Gemini 3 flash costs around 50 cents per million input tokens compared to 3 cents for GLM’s OCR if you buy it from them (which you don’t have to). On top of that I decided to run it locally on my MacBook Pro M3 MAX's CPU running pytorch (I tried using the Metal GPU but it ran out of memory, so I fell back to CPU). The model ended up taking a long time to run locally (about a minute per page). However, it performed the same as Gemini 3 Flash on a former project.

This eval has OCR as part of a multi-step process to extract entities.
The pipeline extracts entities on a wide variety of documents anything from handwriting to a screenshot of a checkout cart at a restaurant.

Here’s how the pipeline works

  1. Get input document

  2. Convert the input document (PDF, XLS etc) into text (this is the step where i dropped in the new model)

  3. Use an LLM (gpt 4o) to structure the raw text into lists of items

  4. Do database search (RAG — we used HNSW on postgres) for each item

  5. Have the LLM match each item to a specific Database item from the list we looked up

  6. Do a QA step where we compare the original Document to the end list

Heres a summary of the results courtesy of Weights and Biases:

Here are some of the things the eval measured

  1. did we get the correct number of groupings of items (called lists here — both found 3.75 lists per document)

  2. did we get the right items (both had an accuracy of 65%)

  3. was the item bucketed into the correct category (GLM had an accuracy of 76% and gemini had 75%)

As you can see the performance on the evals between the two models were very small. Since I ran GLM-OCR locally it took much longer than gemini (20 minutes for the run vs 4 and a half with gemini).

Should you switch to GLM-OCR? Based on these results, yes - especially if you're processing documents at scale. While my CPU-based local run was 60x slower than Gemini Flash, that's a hardware limitation, not a model limitation. Running GLM-OCR on a proper GPU setup (either self-hosted or through z.ai's API) should match or exceed Gemini's latency while costing 16x less (3¢ vs 50¢ per million tokens). For our use case processing thousands of receipts and invoices monthly, that cost difference is significant. I'm planning to move our production workload to GLM-OCR once I set up proper GPU infrastructure.

One final note, when we initially did this project I was shocked to find that actually the best tools for OCR were LLMs. The fact that now you can match the best OCR performance with a tiny, cheap LLM I find astounding.

Tiny OCR