Confidence Threshold chart can be used just on balanced chatbot, and balanced testset.
If booth of them are balanced, then the percent of the correct responses must be increasing. Some flakiness, even fallbacks, are accepted, but the trend must be increasing.
Ideally the percent should be increasing linear from 0% to 100%.
You can see in this sample chart:

There are no correct answers below confidence score 0.2. (Reliability line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be at least 0.2

There are no incorrect answers above confidence score 0.7. (Precision line of the Confidence Threshold chart is at confidence 1). Confidence Threshold should be maximal 0.7

Trend between 0.2, and 0.7 looks correct (we expect some correlance between confidence score, and the percent of the correct answers, otherwise has no sense to tune Confidence Threshold)

The maximum of the F1 score is between 0.3 and 0.35. We can see in the Confidence Score Reliability chart, that because there are just correct responses.
