Results (Sortable)
Click headers to sort. The first header row indicates task categories: Syntax, Semantic/Static, Dynamic.
Tasks & Metrics
Nine tasks across Syntax, Semantic/Static, and Dynamic dimensions with unified metrics and diagnosis rules.
- Pass rate: AST_passes / AST_cases
- Shown in Results as AST
- Hit@5 / Hit@10 / Hit@20
- Shown in Results as Expr@k
- Pass rate: CFG_passes / CFG_cases
- Shown in Results as CFG
- Pass rate: CG_passes / CG_cases
- Shown in Results as CG
- F1 score
- Shown in Results as DP F1
- F1 score
- Shown in Results as Taint F1
- Accuracy
- Shown in Results as Pointer
- Few‑shot success rate
- Zero‑shot success rate
- Summary accuracy
- Concept accuracy
Artifacts & Plots
Snapshot numbers are available via the JSON artifact and Results page. The repository includes scripts to regenerate figures and tables.
Get Started
Minimal steps to run and reproduce metrics:
# setup
bash scripts/setup_venv.sh -e -x
source .venv/bin/activate
cp .env.example .env # fill provider keys
# evaluate and aggregate
python evaluation/evaluate_multi_models.py --out results/aggregated_summary.json
# render figures (optional)
python evaluation/render_graphs.py
See README and README_zh for datasets, prompts, and script options.
Supplementary Material
See Results and JSON for detailed notes and aggregated metrics.
Citation
@article{ma2023exploring,
title={Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs},
author={Ma, Wei and Lin, Zhihao and Liu, Shangqing and Hu, Qiang and Liu, Ye and Wang, Wenhan and Zhang, Cen and Nie, Liming and Li, Li and Liu, Yang and Jiang, Lingxiao},
journal={arXiv preprint arXiv:2305.12138},
year={2023}
}
Acknowledgements
We thank Ximing Xing for providing the webpage template.