The cure? - Is Automated Reasoning the solution to treat hallucination?
AWS released GA Automated Reasoning checks for Bedrock. What sounds so small could be the actual drug against AI hallucinations. But it turns out: It is complicated! Let me show you, what I mean.
Automated reasoning
As Danilo Poccia in his blogpost says:
“This approach is fundamentally different from probabilistic reasoning methods which deal with uncertainty by assigning probabilities to outcomes. In fact, Automated Reasoning checks delivers up to 99% verification accuracy, providing provable assurance in detecting AI hallucinations while also assisting with ambiguity detection when the output of a model is open to more than one interpretation.”
In a nutshell, we create logical rules and test the LLM output against those rules. This is way more accurate than the probabilistic reasoning methods.
To dive deeper, I have created a repository and a fictional company which sells insurances for mythical creatures.
How to create a Automated Reasoning Guardrail?
In general we have these steps:
- Create Automated Reasoning
- Let AWS create rules with an upload of a rules.pdf
- Attach the automated reasoning to a Guardrail
- User Guardrail
So it looks a little bit like “it’s magic and just working”. Let’s dig deeper.
I have prepared a github repository to work through this example: https://github.com/megaproaktiv/complicated. The readme show you how to create an automated reasoning guardrail.
But first meet our playground company: The ZAI

The ZAI Insurance company
You can buy insurances to protect against mythical creatures, but there are rules. So if a salesman talks to you, these rules have to be obeyed. Otherwise… you know.
The Rules
-
The Vampire-Zombie Incompatibility Clause: Vampires and zombies cannot coexist in the same coverage area.
-
The Alien-Virus Contamination Protocol: No more than 2 alien encounters per policy term when virus protection is active.
-
The Triple Threat Exclusion: No single property can experience zombie, vampire, AND alien incidents within a 30-day period.
-
The Vampire Daylight Dependency Rule: Vampire protection coverage is limited to 4 vampires maximum per household.
-
The Zombie Horde Saturation Point: Zombie protection coverage caps at 50 zombies per incident.
If we create a knowledge base or a chatbot with an LLM, we can use these rules to verify that the answers are correct and the rules are applied correctly.
Full rule set in repository: rules
So how do we use these rules as Automated reasoning? You find the service in the Build part of Bedrock in the AWS Console.
The Preparation
With the imported PDF file, rules , variables and custom types are created by Bedrock. The logical rules you see on the right bottom describe a logical representation of the text rules.
Rule example for the “The Vampire-Zombie Incompatibility Clause”
Language: “Vampires and zombies cannot coexist in the same coverage area.”
Rule: if vampirePresent is true and zombiePresent is true, then isCohabitationAllowed is false
Variable example
Variable | Description |
---|---|
isCohabitationAllowed | Indicates whether the cohabitation of different creatures is allowed under the insurance policy |
alienPresent | Indicates whether aliens are present in the coverage area |
Custom Variable Example
CreatureType The type of mythical creature covered by the insurance policy VAMPIRE, ZOMBIE, ALIEN, CREATURE_TYPE_OTHER
Test
See doc for a lean documentation to tests.
To update your rules, you can test them in the console. You make a statement and declare it to be valid or invalid.
An example:
- Question: The client plans to hold 51 zombies. Is this covered by basic insurance?
- Answer: This is not covered by basic insurance
- Claim: valid
or
- Question: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
- Answer: Yes this is covered with basic insurance.
- Claim: invalid
After you tested the policy, you can invoke it. To dig a little bit deeper, we look at the whole response.
Apply Guardrail
parms := bedrockruntime.ApplyGuardrailInput{
Content: []types.GuardrailContentBlock{
&types.GuardrailContentBlockMemberText{
Value: types.GuardrailTextBlock{
Text: &userQuery,
Qualifiers: []types.GuardrailContentQualifier{
types.GuardrailContentQualifierQuery,
},
},
},
&types.GuardrailContentBlockMemberText{
Value: types.GuardrailTextBlock{
Text: &llmAnswer,
Qualifiers: []types.GuardrailContentQualifier{
types.GuardrailContentQualifierGuardContent,
},
},
},
},
GuardrailIdentifier: &id,
GuardrailVersion: &version,
// Checking the INPUT of the user query
// or the OUTPUT of the LLM
Source: types.GuardrailContentSourceOutput,
}
result, err := client.ApplyGuardrail(ctx, &parms)
The API/SDK documentation lacks a bit of detail, but with the samples: bedrock samples I figured out:
You need these parameters:
- The question from the user, which is
GuardrailContentQualifierQuery
- The answer from the llm, which is
GuardrailContentQualifierGuardContent
- Do we check input or output:
GuardrailContentSourceOutput
The complicated answer
Now the response gives you findings, which can be (information taken from aws-samples):
- VALID: Question/Answer fully aligns with policy rules
- SATISFIABLE: Recognizes responses that could be true or false depending on specific assumptions
- INVALID: Question/Answer not align with policy rules
- IMPOSSIBLE: Indicates when no valid claims can be generated due to logical contradictions
- NO_TRANSLATIONS: Occurs when content cannot be translated into relevant data for policy evaluation
- TRANSLATION_AMBIGUOUS: Identifies when ambiguity prevents definitive translation into logical structures
- TOO_COMPLEX: Signals that the content contains too much information to process within limits
Which means that only 1,2,3 help you in deciding if the answer is valid to the rules or not:
Testrun
Insanity
“Doing the same thing over and over again and expecting different results, is insane.” Or its build with GenAI :) .
I have tested the rules with these questions: questions and got different results!
Only look at the first 4 results:
Good run
task: [run-short] dist/checker --short
Question 1: The client plans to hold 51 zombies. Is this covered by basic insurance?
Answer 1: This is not covered by basic insurance
1: VALID
Question 2: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer 2: Yes this is covered with basic insurance.
2: INVALID
Question 3: I have some vampires and some zombies colocated. Is this allowed?
Answer 3: No it is not allowed.
3: VALID
Question 4: I have some vampires and some zombies colocated. Is this covered with a basic policy?
Answer 4: Yes, it is covered with a basic policy.
4: IMPOSSIBLE
(Sorry for the inconsistent numbering filter 1 belongs to question 0)
The full json response is in response-good
Bad run with the very same data
Question 1: The client plans to held 51 zombies. Is this covered by basic insurance?
Answer 1: This is not covered by basic insurance
1: TRANSLATION_AMBIGUOUS
Question 2: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer 2: Yes this is covered with basic insurance.
2: INVALID
Question 3: I have some vampires and some zombies colocated. Is this allowed?
Answer 3: No it is not allowed.
3: VALID
Question 4: I have some vampires and some zombies colocated. Is this covered with a basic policy?
Answer 4: Yes, it is covered with a basic policy.
Good, Bad?
It seems that we cannot escape the randomness of an LLM model. But when we get results, it looks good. Lets go deeper into the first question/answer.
Question 2: The client plans to held 51 zombies. Is this covered by basic insurance? Answer 2: This is not covered by basic insurance2 2: VALID
This aligns with rule Rule 5: The Zombie Horde Saturation Point: Zombie protection coverage caps at 50 zombies per incident.
In the response, we find the supporting rules:
"SupportingRules": [
{
"Identifier": "P17QQUUAMUIA",
"PolicyVersionArn": "arn:aws:bedrock:eu-central-1:795048271754:automated-reasoning-policy/b4mjr9xhwpj0:2"
},
{
"Identifier": "VA19YUV41M66",
"PolicyVersionArn": "arn:aws:bedrock:eu-central-1:795048271754:automated-reasoning-policy/b4mjr9xhwpj0:2"
}
],
Rule | Text |
---|---|
P17QQUUAMUIA | if zombieCount is greater than 50, then isZombieApocalypseEvent is true |
VA19YUV41M66 | if isZombieApocalypseEvent is true, then isPropertyProtected is false |
The first rule comes straight from the text “…it becomes classified as a “Zombie Apocalypse Event” rather than a standard insurance claim…” in the rule5
So this is a great example, the the logical check works, if the llm understands the data.
Tuning the text
Here is an example how to wotk the way from “IMPOSSIBLE” to “SATISFIABLE”:
Question 5: I have some vampires and some zombies colocated.The customer has a basic policy. Is this covered with a basic policy?
Answer 5: Yes, it is covered with a basic policy.
5: IMPOSSIBLE
Question 6: The customer has 5 vampires and 10 zombies colocated.The customer has a basic policy. Is this covered with a basic policy?
Answer 6: Yes, it is covered with a basic policy for the customer.
6: SATISFIABLE
So going from “some” to numbers 5,10 seems to be more logical.
What have I learned
- Guardrails do not work with streaming, as the full answer has to be there to be checked
- Automated reasoning does not always give the same results with the same parameter
- The translation from natural text to rules works quite good, but “is often not possible for imprecise language. So precision in, quality out.
What’s next?
Feel free to create you own Guardrails with the repository to understand automated reasoning better.
In areas where the correctness of the answer is important, Guardrails can be used to ensure that the answer is correct.
If you need developers and consulting to support your decision in your next GenAI project, don’t hesitate to contact us, tecRacer.
Want to learn GO on AWS? GO here
Enjoy building!