How extraction accuracy is measured – Tekst

Tekst measures how well an extraction model is performing by comparing the values it predicts against a test set of validated examples, field by field. This article explains how that score is calculated so you can interpret your accuracy and know which fields to improve.

If you have not built a test set yet, see the "Set up your first extraction model" article first.

Scoring happens per field

Unlike a classification model, which produces a single answer per message, an extraction model produces a value for every entity. So accuracy is measured at the field level first: for each test item, Tekst compares the predicted value of each field against the value you confirmed (the "ground truth").

The overall accuracy you see is the average across all fields of all test items. Because every field counts toward the average, a model can score well overall while still struggling with one specific field - which is why the per-field breakdown matters.

How values are compared

Not all fields are compared the same way, because an exact-character match is not always fair:

Text values are compared with fuzzy matching. The predicted text counts as correct when it is similar enough to the expected text, rather than requiring a character-perfect match. This avoids penalizing trivial differences such as spacing or punctuation.
Numbers and true/false values are compared exactly. They either match the confirmed value or they do not.

List fields

When an entity is a list (for example several line items), Tekst compares the predicted list against the expected one and also reports a too many count - the number of extra items the model returned that were not expected. This tells you whether the model is over-extracting.

Fields you do not want to score

Some fields are useful to extract but not meaningful to grade - for example a free-text note. You can mark these so they are excluded from the accuracy calculation, keeping your score focused on the fields that matter.

The comparison view

The clearest way to diagnose a model is the comparison view, which lists each field with its Expected value beside its Predicted value. Scanning this view shows you at a glance which fields are reliable and which need a clearer description or more examples in the test set.

Improving the score

The most effective levers are sharper entity descriptions, the right output format for each field, and a richer test set that covers the document variations you see in practice. As your team corrects extracted values over time, those corrections also feed back into the model.

For related background, see What data does Tekst store and how does the learning process work?.