AutoGen group chat failure detection

AutoGen group chats are the highest-failure-rate orchestration pattern in production. Two-agent loops, premature termination, and role bleed account for 60%+ of failed runs in the TRAIL benchmark. Pisama detects all three structurally; no LLM judge needed.

The AutoGen adapter instruments `GroupChatManager.run()` and `Conversable.initiate_chat()`. State after each turn is hashed for loop detection; speaker selection is tracked for termination and coordination detectors.

Detectors specific to AutoGen

  • Loop detection
    F1 0.830: same state recurring across turns
  • Coordination failure
    F1 0.746: speaker never addresses prior speaker
  • Persona drift
    F1 0.794: agent ignores assigned role/expertise
  • Communication breakdown
    F1 0.769: back-and-forth without progress
  • Information withholding
    F1 0.867: agent withholds known answer

Install

pip install pisama pisama-auto
from pisama.auto import instrument_autogen
from autogen import GroupChat, GroupChatManager

instrument_autogen()
manager = GroupChatManager(groupchat=GroupChat([...]))
manager.run(message="...")  # detectors run on every turn

FAQ

Does this work with AutoGen 0.4 (the rewrite)?
Yes. The adapter targets the `autogen-agentchat` and `autogen-core` packages. The legacy `pyautogen` package is also supported via the `pisama-auto` shim.
How do you detect a two-agent loop without an LLM?
Hash the (sender, receiver, content-fingerprint) tuple per turn. If the same tuple recurs within a configurable window, it is a loop. Subsequence matching catches longer cycles (A, B, C, A, B, C). No LLM call needed.

See the full detector taxonomy at /taxonomy, benchmark numbers at /benchmarks, or compare against other observability stacks at /vs.