Agent Evaluation Workbench
Benchmarking multi-step tool-use reliability with reproducible test suites.
Next.jsTypeScriptPostgreSQLOpenAI API
Open project ->A project portfolio of active investigations and implementation experiments.
Benchmarking multi-step tool-use reliability with reproducible test suites.
Comparing context window, vector retrieval, and hybrid memory pipelines.
Tracing task latency and failure modes across agent orchestration layers.
Designing a bilingual publishing workflow for research notes and technical essays.