Skip to main content

Confidence Threshold

Confidence Threshold is the lowest accepted confidence.



Threshold Purpose

In Some Chatbot Engines you can set Confidence Threshold. If the Engine is not sure enough at classifying an intent (its confidence is below Confidence Threshold) then they will answer with incomprehension intent to show that they don't understand.

But you can deal with low confidence responses programmatically, dropping the indent, asking back, display alternatives, and so on.

This chart helps you to find the optimal Confidence Threshold:
  • Precision, Recall, and F1 Score without maximum means that the chatbot, or the test is unbalanced. It should be investigated further.

  • Confidence Threshold should be between the max of the Precision, and Recall

  • For general chatbots good Confidence Threshold is the max of the F1 Score

  • For a precise chatbot, like a Banking Bot we can find a Confidence Score between the max of the F1 Score, and the Precision using this Chart. (Win Precision but loose Recall.)

  • If a 'dont understand' answer is worse than an incorrect answer, like in a FAQ bot, then we can do the same between F1 Score, and Recall. (Win Recall but loose Precision.)

Note: You can read more about Precision, Recall and F1 score here in this article by Botium's own Florian Treml.

Intent is recognized as correct if:

  • there is an Intent Asserter which does not fail.

  • There is no asserter and the response is not incomprehension

If the Chatbot Engine supports Confidence Threshold, then the best practice is to turn it off, or set it as low as possible BEFORE STARTING THE TEST SESSION to cover the largest spectrum of possible confidences!

You can examine the confidence with this chart. Increasing precision, and decreasing recall is expected. Otherwise deeper investigation is required. On our screenshot is an example of recall decreasing. Precision is increasing at the start, but after, it stops. This is because there were some false positive cases with confidence 1.
Tip: Setting Confidence Threshold is fine tuning. Build chatbot, test it, set Confidence Threshold, and publish it.

This Chart may be helpful if you want to compare more chatbot engines

Confidence Score Reliability

Reliability Purpose

A Confidence Threshold chart can only be used with a balanced chatbot and a balanced test set. If both of them are balanced, then the percent of the correct responses must be increasing. Some flakiness, even fall backs, are accepted, but the trend must be increasing.

Ideally the percent should be increasing linear from 0% to 100%. You can see in this sample chart:
  • There are no correct answers below confidence score 0.2. (Reliability line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be at least 0.2

  • There are no incorrect answers above confidence score 0.7. (Precision line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be maximal 0.7

  • Trend between 0.2, and 0.7 looks correct (we expect some correlation between confidence score, and the percent of the correct answers, otherwise has no sense to tune Confidence Threshold)

  • The maximum of the F1 score is between 0.3 and 0.35. We can see in the Confidence Score Reliability chart, that because there are just correct responses.

Was this article helpful?

0 out of 0 found this helpful