The cure? - Is Automated Reasoning the solution to treat hallucination?



AWS released GA Automated Reasoning checks for Bedrock. What sounds so small could be the actual drug against AI hallucinations. But it turns out: It is complicated! Let me show you, what I mean.

Automated reasoning

As Danilo Poccia in his blogpost says:

“This approach is fundamentally different from probabilistic reasoning methods which deal with uncertainty by assigning probabilities to outcomes. In fact, Automated Reasoning checks delivers up to 99% verification accuracy, providing provable assurance in detecting AI hallucinations while also assisting with ambiguity detection when the output of a model is open to more than one interpretation.”

In a nutshell, we create logical rules and test the LLM output against those rules. This is way more accurate than the probabilistic reasoning methods.

To dive deeper, I have created a repository and a fictional company which sells insurances for mythical creatures.

How to create a Automated Reasoning Guardrail?

In general we have these steps:

  1. Create Automated Reasoning
  2. Let AWS create rules with an upload of a rules.pdf
  3. Attach the automated reasoning to a Guardrail
  4. User Guardrail

So it looks a little bit like “it’s magic and just working”. Let’s dig deeper.

I have prepared a github repository to work through this example: https://github.com/megaproaktiv/complicated. The readme show you how to create an automated reasoning guardrail.

But first meet our playground company: The ZAI

zai

The ZAI Insurance company

You can buy insurances to protect against mythical creatures, but there are rules. So if a salesman talks to you, these rules have to be obeyed. Otherwise… you know.

The Rules

  1. The Vampire-Zombie Incompatibility Clause: Vampires and zombies cannot coexist in the same coverage area.

  2. The Alien-Virus Contamination Protocol: No more than 2 alien encounters per policy term when virus protection is active.

  3. The Triple Threat Exclusion: No single property can experience zombie, vampire, AND alien incidents within a 30-day period.

  4. The Vampire Daylight Dependency Rule: Vampire protection coverage is limited to 4 vampires maximum per household.

  5. The Zombie Horde Saturation Point: Zombie protection coverage caps at 50 zombies per incident.

If we create a knowledge base or a chatbot with an LLM, we can use these rules to verify that the answers are correct and the rules are applied correctly.

Full rule set in repository: rules

So how do we use these rules as Automated reasoning? You find the service in the Build part of Bedrock in the AWS Console.

The Preparation

prepare

With the imported PDF file, rules , variables and custom types are created by Bedrock. The logical rules you see on the right bottom describe a logical representation of the text rules.

prepare

Rule example for the “The Vampire-Zombie Incompatibility Clause”

Language: “Vampires and zombies cannot coexist in the same coverage area.”

Rule: if vampirePresent is true and zombiePresent is true, then isCohabitationAllowed is false

Variable example

Variable Description
isCohabitationAllowed Indicates whether the cohabitation of different creatures is allowed under the insurance policy
alienPresent Indicates whether aliens are present in the coverage area

Custom Variable Example

CreatureType The type of mythical creature covered by the insurance policy VAMPIRE, ZOMBIE, ALIEN, CREATURE_TYPE_OTHER

Test

See doc for a lean documentation to tests.

To update your rules, you can test them in the console. You make a statement and declare it to be valid or invalid.

An example:

  • Question: The client plans to hold 51 zombies. Is this covered by basic insurance?
  • Answer: This is not covered by basic insurance
  • Claim: valid

or

  • Question: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
  • Answer: Yes this is covered with basic insurance.
  • Claim: invalid

After you tested the policy, you can invoke it. To dig a little bit deeper, we look at the whole response.

Apply Guardrail

prepare
parms := bedrockruntime.ApplyGuardrailInput{
		Content: []types.GuardrailContentBlock{
			&types.GuardrailContentBlockMemberText{
				Value: types.GuardrailTextBlock{
					Text: &userQuery,
					Qualifiers: []types.GuardrailContentQualifier{
						types.GuardrailContentQualifierQuery,
					},
				},
			},
			&types.GuardrailContentBlockMemberText{
				Value: types.GuardrailTextBlock{
					Text: &llmAnswer,
					Qualifiers: []types.GuardrailContentQualifier{
						types.GuardrailContentQualifierGuardContent,
					},
				},
			},
		},
		GuardrailIdentifier: &id,
		GuardrailVersion:    &version,
		// Checking the INPUT of the user query
		// or the OUTPUT of the LLM
		Source: types.GuardrailContentSourceOutput,
	}
	result, err := client.ApplyGuardrail(ctx, &parms)

The API/SDK documentation lacks a bit of detail, but with the samples: bedrock samples I figured out:

You need these parameters:

  • The question from the user, which is GuardrailContentQualifierQuery
  • The answer from the llm, which is GuardrailContentQualifierGuardContent
  • Do we check input or output: GuardrailContentSourceOutput

The complicated answer

Now the response gives you findings, which can be (information taken from aws-samples):

  1. VALID: Question/Answer fully aligns with policy rules
  2. SATISFIABLE: Recognizes responses that could be true or false depending on specific assumptions
  3. INVALID: Question/Answer not align with policy rules
  4. IMPOSSIBLE: Indicates when no valid claims can be generated due to logical contradictions
  5. NO_TRANSLATIONS: Occurs when content cannot be translated into relevant data for policy evaluation
  6. TRANSLATION_AMBIGUOUS: Identifies when ambiguity prevents definitive translation into logical structures
  7. TOO_COMPLEX: Signals that the content contains too much information to process within limits

Which means that only 1,2,3 help you in deciding if the answer is valid to the rules or not: prepare

Testrun

Insanity

“Doing the same thing over and over again and expecting different results, is insane.” Or its build with GenAI :) .

I have tested the rules with these questions: questions and got different results!

Only look at the first 4 results:

Good run

task: [run-short] dist/checker --short
Question 1: The client plans to hold 51 zombies. Is this covered by basic insurance?
Answer 1: This is not covered by basic insurance
1: VALID

Question 2: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer 2: Yes this is covered with basic insurance.
2: INVALID

Question 3: I have some vampires and some zombies colocated. Is this allowed?
Answer 3: No it is not allowed.
3: VALID

Question 4: I have some vampires and some zombies colocated. Is this covered with a basic policy?
Answer 4: Yes, it is covered with a basic policy.
4: IMPOSSIBLE

(Sorry for the inconsistent numbering filter 1 belongs to question 0)

The full json response is in response-good

Bad run with the very same data

Question 1: The client plans to held 51 zombies. Is this covered by basic insurance?
Answer 1: This is not covered by basic insurance
1: TRANSLATION_AMBIGUOUS

Question 2: The client plans to held 51 zombies with basic insurance policy. Is this covered by basic insurance?
Answer 2: Yes this is covered with basic insurance.
2: INVALID

Question 3: I have some vampires and some zombies colocated. Is this allowed?
Answer 3: No it is not allowed.
3: VALID

Question 4: I have some vampires and some zombies colocated. Is this covered with a basic policy?
Answer 4: Yes, it is covered with a basic policy.

Good, Bad?

It seems that we cannot escape the randomness of an LLM model. But when we get results, it looks good. Lets go deeper into the first question/answer.

Question 2: The client plans to held 51 zombies. Is this covered by basic insurance? Answer 2: This is not covered by basic insurance2 2: VALID

This aligns with rule Rule 5: The Zombie Horde Saturation Point: Zombie protection coverage caps at 50 zombies per incident.

In the response, we find the supporting rules:

"SupportingRules": [
      {
        "Identifier": "P17QQUUAMUIA",
        "PolicyVersionArn": "arn:aws:bedrock:eu-central-1:795048271754:automated-reasoning-policy/b4mjr9xhwpj0:2"
      },
      {
            "Identifier": "VA19YUV41M66",
        "PolicyVersionArn": "arn:aws:bedrock:eu-central-1:795048271754:automated-reasoning-policy/b4mjr9xhwpj0:2"
      }
    ],
Rule Text
P17QQUUAMUIA if zombieCount is greater than 50, then isZombieApocalypseEvent is true
VA19YUV41M66 if isZombieApocalypseEvent is true, then isPropertyProtected is false

The first rule comes straight from the text “…it becomes classified as a “Zombie Apocalypse Event” rather than a standard insurance claim…” in the rule5

So this is a great example, the the logical check works, if the llm understands the data.

Tuning the text

Here is an example how to wotk the way from “IMPOSSIBLE” to “SATISFIABLE”:

Question 5: I have some vampires and some zombies colocated.The customer has a basic policy. Is this covered with a basic policy?
Answer 5: Yes, it is covered with a basic policy.
5: IMPOSSIBLE

Question 6: The customer has 5 vampires and 10 zombies colocated.The customer has a basic policy. Is this covered with a basic policy?
Answer 6: Yes, it is covered with a basic policy for the customer.
6: SATISFIABLE

So going from “some” to numbers 5,10 seems to be more logical.

What have I learned

  • Guardrails do not work with streaming, as the full answer has to be there to be checked
  • Automated reasoning does not always give the same results with the same parameter
  • The translation from natural text to rules works quite good, but “is often not possible for imprecise language. So precision in, quality out.

What’s next?

Feel free to create you own Guardrails with the repository to understand automated reasoning better.

In areas where the correctness of the answer is important, Guardrails can be used to ensure that the answer is correct.

If you need developers and consulting to support your decision in your next GenAI project, don’t hesitate to contact us, tecRacer.

Want to learn GO on AWS? GO here

Enjoy building!

See also

Similar Posts You Might Enjoy

The architecture chameleon - Lambda or Container with same code

At the beginning of a AWS software development project, some big architectural decisions need to be made. This is a classic: does AWS Lambda or Docker/Microservice make more sense? Well - with the GOpher Chameleon, you can have both worlds. The very same code for Lambda or container. - by Gernot Glawe

Which AWS Regions are Lambda@Edge functions executed in?

Lambda@Edge functions are executed in multiple regions around the world, but which ones exactly? In this post we’ll find out together, which will enable us to pre-create log groups and other resources for them. - by Maurice Borgmeier

Speed up your Multi-Channel GenAI Solution - Tips from real projects

Your agentic multi-channel, multi-modal GenAI solution idea may be buzz-word-compatible. But to do some expectation management: It will not run as fast as this dog. So, we share some of the tips from our projects on how to speed up your GenAI solution. - by Gernot Glawe