Confidence Threshold is the lowest accepted confidence.
Threshold Purpose
In Some Chatbot Engines you can set Confidence Threshold. If the Engine is not sure enough at classifying an intent (its confidence is below Confidence Threshold) then they will answer with incomprehension intent to show that they don't understand.
But you can deal with low confidence responses programmatically, dropping the indent, asking back, display alternatives, and so on.
-
Precision, Recall, and F1 Score without maximum means that the chatbot, or the test is unbalanced. It should be investigated further.
-
Confidence Threshold should be between the max of the Precision, and Recall
-
For general chatbots good Confidence Threshold is the max of the F1 Score
-
For a precise chatbot, like a Banking Bot we can find a Confidence Score between the max of the F1 Score, and the Precision using this Chart. (Win Precision but loose Recall.)
-
If a 'dont understand' answer is worse than an incorrect answer, like in a FAQ bot, then we can do the same between F1 Score, and Recall. (Win Recall but loose Precision.)
Intent is recognized as correct if:
-
there is an Intent Asserter which does not fail.
-
There is no asserter and the response is not incomprehension
If the Chatbot Engine supports Confidence Threshold, then the best practice is to turn it off, or set it as low as possible BEFORE STARTING THE TEST SESSION to cover the largest spectrum of possible confidences!
This Chart may be helpful if you want to compare more chatbot engines
Confidence Score Reliability
Reliability Purpose
A Confidence Threshold chart can only be used with a balanced chatbot and a balanced test set. If both of them are balanced, then the percent of the correct responses must be increasing. Some flakiness, even fall backs, are accepted, but the trend must be increasing.
-
There are no correct answers below confidence score 0.2. (Reliability line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be at least 0.2
-
There are no incorrect answers above confidence score 0.7. (Precision line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be maximal 0.7
-
Trend between 0.2, and 0.7 looks correct (we expect some correlation between confidence score, and the percent of the correct answers, otherwise has no sense to tune Confidence Threshold)
-
The maximum of the F1 score is between 0.3 and 0.35. We can see in the Confidence Score Reliability chart, that because there are just correct responses.