How classification accuracy is measured – Tekst

Tekst measures how well a classification model is performing by comparing its predictions against a test set of validated examples. This article explains how that score is calculated so you can interpret the accuracy you see and know where to focus your improvements.

If you have not built a test set yet, see the "Set up your first classification model" article first.

The test set is the source of truth

Accuracy is always measured against your test set: a collection of messages for which you have confirmed the correct label (the "ground truth"). For each item, Tekst compares the label the model predicted against the label you confirmed. The more representative your test set, the more meaningful the accuracy figure.

This means accuracy cannot be calculated until you have added test items with confirmed labels. Until then, the model's accuracy is shown as unknown.

Single-label scoring

For a single-label model, scoring is straightforward: a prediction is correct only if it exactly matches the confirmed label, and wrong otherwise.

Tekst reports:

The overall accuracy percentage across all test items.
The number of correct and wrong predictions.
A per-label breakdown so you can see which categories perform well and which struggle.
A confusion matrix that shows, for each confirmed label, which labels the model predicted instead. This is the fastest way to spot two labels that the model keeps mixing up.

Reading the confusion matrix

Each row represents the confirmed (correct) label and each column the predicted label. Values on the diagonal are correct predictions. A cluster of values off the diagonal points to two categories whose descriptions overlap too much - usually a sign that you should sharpen the wording of those labels or add more examples.

Multi-label scoring

For a multi-label model, a single message can have several correct labels, so scoring compares the full set of predicted labels against the full set of confirmed labels. An item counts as correct only when the predicted set exactly matches the confirmed set.

To help you see where the model goes wrong, Tekst breaks each label down into:

Correct: labels the model predicted that were expected.
Missed: expected labels the model failed to predict.
Too many: labels the model predicted that were not expected.

This tells you whether the model is being too cautious (missing labels) or too eager (adding labels that do not belong).

Improving the score

Use the breakdowns above to decide where to act. The most effective levers are usually clearer label descriptions and a richer, more balanced test set. As your team corrects predictions over time, those corrections also feed back into the model.

For related background, see:

The test set is the source of truth

Single-label scoring

Reading the confusion matrix

Multi-label scoring

Improving the score

Related articles