25 Sep '25

The cure? - Is Automated Reasoning the solution to treat hallucination?

AWS released GA Automated Reasoning checks for Bedrock. What sounds so small could be the actual drug against AI hallucinations. But it turns out: It is complicated! Let me show you, what I mean.

Automated reasoning

As Danilo Poccia in his blogpost says:

“This approach is fundamentally different from probabilistic reasoning methods which deal with uncertainty by assigning probabilities to outcomes. In fact, Automated Reasoning checks delivers up to 99% verification accuracy, providing provable assurance in detecting AI hallucinations while also assisting with ambiguity detection when the output of a model is open to more than one interpretation.”

In a nutshell, we create logical rules and test the LLM output against those rules. This is way more accurate than the probabilistic reasoning methods.

To dive deeper, I have created a repository and a fictional company which sells insurances for mythical creatures.

How to create a Automated Reasoning Guardrail?

In general we have these steps:

Create Automated Reasoning
Let AWS create rules with an upload of a rules.pdf
Attach the automated reasoning to a Guardrail
User Guardrail

So it looks a little bit like “it’s magic and just working”. Let’s dig deeper.

I have prepared a github repository to work through this example: https://github.com/megaproaktiv/complicated. The readme show you how to create an automated reasoning guardrail.

But first meet our playground company: The ZAI

The ZAI Insurance company

You can buy insurances to protect against mythical creatures, but there are rules. So if a salesman talks to you, these rules have to be obeyed. Otherwise… you know.

The Rules

The Vampire-Zombie Incompatibility Clause: Vampires and zombies cannot coexist in the same coverage area.
The Alien-Virus Contamination Protocol: No more than 2 alien encounters per policy term when virus protection is active.
The Triple Threat Exclusion: No single property can experience zombie, vampire, AND alien incidents within a 30-day period.
The Vampire Daylight Dependency Rule: Vampire protection coverage is limited to 4 vampires maximum per household.
The Zombie Horde Saturation Point: Zombie protection coverage caps at 50 zombies per incident.

If we create a knowledge base or a chatbot with an LLM, we can use these rules to verify that the answers are correct and the rules are applied correctly.

Full rule set in repository: rules

So how do we use these rules as Automated reasoning? You find the service in the Build part of Bedrock in the AWS Console.

The Preparation

With the imported PDF file, rules , variables and custom types are created by Bedrock. The logical rules you see on the right bottom describe a logical representation of the text rules.

Rule example for the “The Vampire-Zombie Incompatibility Clause”

Language: “Vampires and zombies cannot coexist in the same coverage area.”

Rule: if vampirePresent is true and zombiePresent is true, then isCohabitationAllowed is false

Variable example

Variable	Description
isCohabitationAllowed	Indicates whether the cohabitation of different creatures is allowed under the insurance policy
alienPresent	Indicates whether aliens are present in the coverage area

Custom Variable Example

CreatureType The type of mythical creature covered by the insurance policy VAMPIRE, ZOMBIE, ALIEN, CREATURE_TYPE_OTHER

Test

See doc for a lean documentation to tests.

To update your rules, you can test them in the console. You make a statement and declare it to be valid or invalid.

An example:

Question: The client plans to hold 51 zombies. Is this covered by basic insurance?
Answer: This is not covered by basic insurance
Claim: valid

Question: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer: Yes this is covered with basic insurance.
Claim: invalid

After you tested the policy, you can invoke it. To dig a little bit deeper, we look at the whole response.

Apply Guardrail

parms := bedrockruntime.ApplyGuardrailInput{
		Content: []types.GuardrailContentBlock{
			&types.GuardrailContentBlockMemberText{
				Value: types.GuardrailTextBlock{
					Text: &userQuery,
					Qualifiers: []types.GuardrailContentQualifier{
						types.GuardrailContentQualifierQuery,
					},
				},
			},
			&types.GuardrailContentBlockMemberText{
				Value: types.GuardrailTextBlock{
					Text: &llmAnswer,
					Qualifiers: []types.GuardrailContentQualifier{
						types.GuardrailContentQualifierGuardContent,
					},
				},
			},
		},
		GuardrailIdentifier: &id,
		GuardrailVersion:    &version,
		// Checking the INPUT of the user query
		// or the OUTPUT of the LLM
		Source: types.GuardrailContentSourceOutput,
	}
	result, err := client.ApplyGuardrail(ctx, &parms)

The API/SDK documentation lacks a bit of detail, but with the samples: bedrock samples I figured out:

You need these parameters:

The question from the user, which is GuardrailContentQualifierQuery
The answer from the llm, which is GuardrailContentQualifierGuardContent
Do we check input or output: GuardrailContentSourceOutput

The complicated answer

Now the response gives you findings, which can be (information taken from aws-samples):

VALID: Question/Answer fully aligns with policy rules
SATISFIABLE: Recognizes responses that could be true or false depending on specific assumptions
INVALID: Question/Answer not align with policy rules
IMPOSSIBLE: Indicates when no valid claims can be generated due to logical contradictions
NO_TRANSLATIONS: Occurs when content cannot be translated into relevant data for policy evaluation
TRANSLATION_AMBIGUOUS: Identifies when ambiguity prevents definitive translation into logical structures
TOO_COMPLEX: Signals that the content contains too much information to process within limits

Which means that only 1,2,3 help you in deciding if the answer is valid to the rules or not: prepare

Testrun

Insanity

“Doing the same thing over and over again and expecting different results, is insane.” Or its build with GenAI :) .

I have tested the rules with these questions: questions and got different results!

Only look at the first 4 results:

Good run

task: [run-short] dist/checker --short
Question 1: The client plans to hold 51 zombies. Is this covered by basic insurance?
Answer 1: This is not covered by basic insurance
1: VALID

Question 2: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer 2: Yes this is covered with basic insurance.
2: INVALID

Question 3: I have some vampires and some zombies colocated. Is this allowed?
Answer 3: No it is not allowed.
3: VALID

Question 4: I have some vampires and some zombies colocated. Is this covered with a basic policy?
Answer 4: Yes, it is covered with a basic policy.
4: IMPOSSIBLE

(Sorry for the inconsistent numbering filter 1 belongs to question 0)

The full json response is in response-good

Bad run with the very same data

Question 1: The client plans to held 51 zombies. Is this covered by basic insurance?
Answer 1: This is not covered by basic insurance
1: TRANSLATION_AMBIGUOUS

Question 2: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer 2: Yes this is covered with basic insurance.
2: INVALID

Question 3: I have some vampires and some zombies colocated. Is this allowed?
Answer 3: No it is not allowed.
3: VALID

Question 4: I have some vampires and some zombies colocated. Is this covered with a basic policy?
Answer 4: Yes, it is covered with a basic policy.

Good, Bad?

It seems that we cannot escape the randomness of an LLM model. But when we get results, it looks good. Lets go deeper into the first question/answer.

Question 2: The client plans to held 51 zombies. Is this covered by basic insurance? Answer 2: This is not covered by basic insurance2 2: VALID

This aligns with rule Rule 5: The Zombie Horde Saturation Point: Zombie protection coverage caps at 50 zombies per incident.

In the response, we find the supporting rules:

"SupportingRules": [
      {
        "Identifier": "P17QQUUAMUIA",
        "PolicyVersionArn": "arn:aws:bedrock:eu-central-1:795048271754:automated-reasoning-policy/b4mjr9xhwpj0:2"
      },
      {
            "Identifier": "VA19YUV41M66",
        "PolicyVersionArn": "arn:aws:bedrock:eu-central-1:795048271754:automated-reasoning-policy/b4mjr9xhwpj0:2"
      }
    ],

Rule	Text
P17QQUUAMUIA	`if zombieCount is greater than 50, then isZombieApocalypseEvent is true`
VA19YUV41M66	if isZombieApocalypseEvent is true, then isPropertyProtected is false

The first rule comes straight from the text “…it becomes classified as a “Zombie Apocalypse Event” rather than a standard insurance claim…” in the rule5

So this is a great example, the the logical check works, if the llm understands the data.

Tuning the text

Here is an example how to wotk the way from “IMPOSSIBLE” to “SATISFIABLE”:

Question 5: I have some vampires and some zombies colocated.The customer has a basic policy. Is this covered with a basic policy?
Answer 5: Yes, it is covered with a basic policy.
5: IMPOSSIBLE

Question 6: The customer has 5 vampires and 10 zombies colocated.The customer has a basic policy. Is this covered with a basic policy?
Answer 6: Yes, it is covered with a basic policy for the customer.
6: SATISFIABLE

So going from “some” to numbers 5,10 seems to be more logical.

What have I learned

Guardrails do not work with streaming, as the full answer has to be there to be checked
Automated reasoning does not always give the same results with the same parameter
The translation from natural text to rules works quite good, but “is often not possible for imprecise language. So precision in, quality out.

What’s next?

Feel free to create you own Guardrails with the repository to understand automated reasoning better.

In areas where the correctness of the answer is important, Guardrails can be used to ensure that the answer is correct.

If you need developers and consulting to support your decision in your next GenAI project, don’t hesitate to contact us, tecRacer.

Want to learn GO on AWS? GO here

Enjoy building!

Gernot Glawe

Gernot is Cloud Consultant and Trainer at tecRacer. He is focusing on AWS, DevOps and serverless development.

The cure? - Is Automated Reasoning the solution to treat hallucination?

Automated reasoning

How to create a Automated Reasoning Guardrail?

The ZAI Insurance company

The Rules

The Preparation

Rule example for the “The Vampire-Zombie Incompatibility Clause”

Variable example

Custom Variable Example

Test

Apply Guardrail

The complicated answer

Testrun

Insanity

Good run

Bad run with the very same data

Good, Bad?

Tuning the text

What have I learned

What’s next?

See also

Gernot Glawe

Similar Posts You Might Enjoy

MCP Authentication for Agent Connections in Amazon Bedrock AgentCore

The architecture chameleon - Lambda or Container with same code

Which AWS Regions are Lambda@Edge functions executed in?

Share

The cure? - Is Automated Reasoning the solution to treat hallucination?

Automated reasoning

How to create a Automated Reasoning Guardrail?

The ZAI Insurance company

The Rules

The Preparation

Rule example for the “The Vampire-Zombie Incompatibility Clause”

Variable example

Custom Variable Example

Test

Apply Guardrail

The complicated answer

Testrun

Insanity

Good run

Bad run with the very same data

Good, Bad?

Tuning the text

What have I learned

What’s next?

See also

Gernot Glawe

Similar Posts You Might Enjoy

MCP Authentication for Agent Connections in Amazon Bedrock AgentCore

The architecture chameleon - Lambda or Container with same code

Which AWS Regions are Lambda@Edge functions executed in?