SkillsBench's First-Week Data: Skills Enhance AI Agent Performance by 27%, Community Grows to Over 440 Members in Two Weeks
The first AI agent skills benchmark test, SkillsBench, has announced initial results, showing that skills can significantly improve model performance, with Codex GPT-5.2 and Claude Code Opus 4.5 improving by 13% and 27%, respectively. In just two weeks, the community has rapidly expanded to over 440 members, and 52 real-world tasks have entered the testing process.

SkillsBench is building the first benchmark test specifically designed to measure the effectiveness of AI agent skills. This project not only evaluates the quality of the skills themselves but also tests the ability of the agents to use these skills.
First-week data shows that skills significantly enhance the performance of AI agents. Codex GPT-5.2, with the support of skills, improved from 0.645 to 0.729, a 13% increase; Claude Code Opus 4.5 showed even more significant improvement, jumping from 0.395 to 0.500, a 27% increase.
The community's growth has exceeded expectations. In just two weeks, SkillsBench has gathered over 440 community members, with over 120 registering as contributors, approximately 70% of whom hold doctorates or are doctoral candidates. Currently, 8 tasks have been merged, and 44 tasks are in progress.
All tasks are written by humans and reflect real-world scenarios. The project team includes core authors from well-known projects such as Screenspot Pro, MCP-Universe, and BigCodeBench.
SkillsBench founder Xiangyi Li specifically mentioned the importance of the harbor environment: 'Without using harbor as our testing environment from day one, progress wouldn't have been this fast.' As a contributor to harbor and terminal bench, she also hopes to see more benchmark tests based on harbor.

In related developments, researchers have proposed the concept of an AI research engineering skills library. This open-source library contains 74 specialized skills, covering 18 categories such as model architecture, fine-tuning, distributed training, and inference services. Each skill provides expert-level guidance, real code examples, and production-ready workflows.
The design philosophy of the skills library is clear: enable coding agents to autonomously implement all stages of AI research experiments, from data preparation and model training to deployment and scientific hypothesis validation. Modern AI research requires mastery of dozens of specialized tools and frameworks, and researchers often spend more time debugging infrastructure than validating hypotheses, slowing down the pace of scientific discovery.
Specific tools in the library include over 20 LLM implementations from LitGPT, the Mamba state space model, the RWKV architecture, the Axolotl fine-tuning framework, and vLLM inference services. Installation is also straightforward, with individual skills installable directly via the Claude Code CLI.
SkillsBench is currently recruiting contributors for the ICML and CAIS 2026 conferences. Participants who contribute 1-3 tasks will receive co-authorship based on the task's complexity. Contributions after the ICML deadline will continue into future publications.
This skills-based approach is changing the capabilities of AI agents. As more real-world tasks are added and the community grows, we may soon see AI agents perform significant leaps in complex tasks.
发布时间: 2026-01-13 04:06