We need
Complex input: multi turn conversation with tool call (maybe we can just use Kimi CLI to run?)
Dump request log to check: for bug like #37 , Kimi CLI "seems to work correctly", the only way to validate is to dump all raw request logs. (We can add rule based or LLM based correctness check)