The Botium Dashboard visualizes the NLP performance metrics and suggests steps for improving it.
It will show any pieces of test data that either did not return the expected intent, did return the expected intent but with a low confidence score, or did return the expected intent, but with a confidence score close to another intent’s.
- Open any session in a project
- Open the drawer t the top of the page to reveal the session's settings
- Activate the switch titled Skip penalty for incorrect intent predictions
1st Glance: The Attention Box
The Attention Box shows any alarming events Botium was able to identify. Issues with the CORRECTNESS of the test results will be visualized here:
-
Predicted intent doesn’t match the expected intent
-
Entities have not been recognized
-
Test data is not suitable for analyzing with Botium
Clicking on the message shows the detailed records Botium identified as source of the
problems.
In this case, there are 591 user examples for which the chatbot predicted an intent other than expected. In most cases this means that the training data for the NLU engine has to be refined further by adding more user examples to the expected intent, and maybe removing similar user examples from the incorrectly predicted intent (see Annotate and Augment NLP Training Dataset to learn more.
2nd Glance: Intent Confidence Distribution Chart
This histogram tells us that we have some poor-performing user examples in our test session - meaning the NLP engine returned a low confidence score for them. Issues with the CONFIDENCE of the test results will be visualized here.
Click on the left-most pile on the chart to see the poor-performing user examples, the expected and the predicted intent as well as the confidence score.
A low confidence score usually means that the NLP engine was not able to properly predict the intent for a user example, often resulting in the infamous Sorry, I don’t understand response.
Read here to learn more: Intent Confidence Distribution
3rd Glance: Top 10 Intent Confidence Risks
The radar chart tells us we have several poor-performing intents - the average confidence score is rather low. Click on one of the intents in the chart to see why it performs so poorly.
Issues with the CONFIDENCE of the test results will be visualized here.
You can now see a list of user examples and the (poor) confidence score returned by
the NLP engine.
Read here to learn more: Intent Confidence Risks
4th Glance: Confusion Matrix and Confidence Threshold Chart
A Confusion Matrix shows an overview of the predicted intent vs the expected intent. It answers questions like When sending user example X, I expect the NLU to predict intent Y, what did it actually predict ?.
The expected intents are shown as rows, the predicted intents are shown as columns. User examples are sent to the NLU engine, and the cell value for the expected intent row and the predicted intent column is increased by 1. So whenever predicted and expected intent is a match, the cell value in the diagonal is increased — these are our successful test cases. All other cell values not on the diagonal are our failed test cases.
The most used statistical measures of NLU performance are precision and recall:
-
The question answered by the precision score is: How many predictions of an intent are correct ?
-
The question answered by the recall rate is: How many intents are correctly predicted ?
Read here to learn more: Confusion Matrix
The Confidence Threshold is the lowest accepted confidence score. If the NLP engine is not sure enough at classifying an intent (its confidence score is below confidence threshold) then it will answer with incomprehension intent to show that it doesn’t understand. This chart helps in finding the best confidence threshold for your use case - it visualizes the balance between precision and recall score, and depending on your use case the one or the other may have priority.
Read here to learn more: Confidence Threshold
5th Glance: The Botium Suggestions
Botium will detect any issues with the test results and suggest actions which will
improve the overall NLU performance. It will tell you which intents require more
training data, and if test data is not suitable for performing NLU tests with
it.
Read here to learn more: Botium Suggestions
What else?
- Intent Mismatch Probability Risks: This section shows some charts visualizing the risk that some intents will be mismatched - meaning that the NLU engine predicts the correct intent, but with a confidence score very close to another one. In real-life, a chatbot in this situation often responds with something like I am not sure what you mean - do you mean X or Y ? (In IBM Watson, this is called disambiguation).
- Intent Confidence Deviation Risks: The confidence deviation is a measure for the bandwidth of the predicted confidence score for all the user examples of an intent. It is calculated as standard deviation of the confidence scores.
-
Utterance Distribution: The histograms show the amount of utterances
per predicted intent.
Downloads
-
user examples
-
predicted intent and confidence score
-
extracted entities and confidence score
-
expected intent and entities