Confidence Threshold Purpose
Confidence Threshold is the lowest accepted confidence.
In Some Chatbot Engines you can set Confidence Threshold. If the Engine is not sure enough at classifying an intent (its confidence is below Confidence Threshold) then they will answer with incomprehension intent to show that they dont understand.
But you can deal with low confidence responses programmatically, dropping the indent, asking back, display alternatives, and so on.
This chart helps you to find the optimal Confidence Threshold:
Precision, Recall, and F1 Score without maximum means that the chatbot, or the test is unbalanced. It should be investigated further.
Confidence Threshold should be between the max of the Precision, and Recall
For general chatbots good Confidence Threshold is the max of the F1 Score
For a precise chatbot, like a Banking Bot we can find a Confidence Score between the max of the F1 Score, and the Precision using this Chart. (Win Precision but loose Recall.)
If a 'dont understand' answer is worse than an incorrect answer, like in a FAQ bot, then we can do the same between F1 Score, and Recall. (Win Recall but loose Precision.)
You can read about Precision, Recall and F1 score here: Quality Metrics for NLU/Chatbot Training Data.
Intent is recognized as correct if:
there is an Intent Asserter which does not fail.
There is no asserter and the response is not incomprehension
If the Chatbot Engine supports Confidence Threshold, then the best practice is to turn it off, or set it as low as possible BEFORE STARTING THE TEST SESSION to cover the largest spectrum of possible confidences!
You can examine the confidence with this chart. Increasing precision, and decreasing recall is expected. Otherwise deeper investigation is required.
On our screenshot is recall decreasing. Precision is increasing at the start, but after it it stops. It is because there where some false positive cases with confidence 1.
This Chart may be helpful if you want to compare more chatbot engines
Setting Confidence Threshold is fine tuning. Build chatbot, test it, set Confidence Threshold, and publish it.
Confidence Score Reliability
Confidence Score Reliability Purpose
Confidence Threshold chart can be used just on balanced chatbot, and balanced testset.
If booth of them are balanced, then the percent of the correct responses must be increasing. Some flakiness, even fallbacks, are accepted, but the trend must be increasing.
Ideally the percent should be increasing linear from 0% to 100%.
You can see in this sample chart:
There are no correct answers below confidence score 0.2. (Reliability line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be at least 0.2
There are no incorrect answers above confidence score 0.7. (Precision line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be maximal 0.7
Trend between 0.2, and 0.7 looks correct (we expect some correlance between confidence score, and the percent of the correct answers, otherwise has no sense to tune Confidence Threshold)
The maximum of the F1 score is between 0.3 and 0.35. We can see in the Confidence Score Reliability chart, that because there are just correct responses.