Exploring Code Analysis: Zero‑Shot Insights on Syntax and Semantics with LLMs


Wei Ma1 Zhihao Lin2 Shangqing Liu3 Qiang Hu4 Ye Liu1 Wenhan Wang5 Cen Zhang6 Liming Nie6 Li Li2 Yang Liu6 Lingxiao Jiang1
equal contribution
1 Singapore Management University, Singapore 2 Beihang University, China 3 State Key Laboratory of Novel Software Technology, Nanjing University, China 4 Tianjin University, China 5 University of Alberta, Canada 6 Nanyang Technological University, Singapore

ArXiv Preprint (2023)


Overview

LLM Code Analysis is a lightweight, reproducible evaluation framework that measures how well LLMs perform program‑analysis tasks.

  • Standardized datasets, prompts, parsing, and scoring
  • Unified metrics and JSON artifacts for auditability
  • Consistent diagnosis rules and error categorization
  • Compact review UI to browse model‑ and task‑wise results
  • Coverage: Syntax (AST, Expression), Semantic/Static (CFG/CG, DP/Taint, Pointer), Dynamic (Mutant, Flaky)

Use this site to compare models, inspect aggregate numbers, and regenerate figures/tables with the provided scripts.


SE task overview
Software Engineering task overview (from paper figures/se_tasks_refined.pdf).

Results (Sortable)

Click headers to sort. The first header row indicates task categories: Syntax, Semantic/Static, Dynamic.

Loading…

Tasks & Metrics


Nine tasks across Syntax, Semantic/Static, and Dynamic dimensions with unified metrics and diagnosis rules.

AST (Syntax Tree)
  • Pass rate: AST_passes / AST_cases
  • Shown in Results as AST
Expression
  • Hit@5 / Hit@10 / Hit@20
  • Shown in Results as Expr@k
CFG
  • Pass rate: CFG_passes / CFG_cases
  • Shown in Results as CFG
CG
  • Pass rate: CG_passes / CG_cases
  • Shown in Results as CG
DP (Data‑flow)
  • F1 score
  • Shown in Results as DP F1
Taint
  • F1 score
  • Shown in Results as Taint F1
Pointer
  • Accuracy
  • Shown in Results as Pointer
Mutant
  • Few‑shot success rate
  • Zero‑shot success rate
Flaky
  • Summary accuracy
  • Concept accuracy

Artifacts & Plots


Snapshot numbers are available via the JSON artifact and Results page. The repository includes scripts to regenerate figures and tables.

AST pass/fail bars (outputs)
AST: pass/fail bars (latest outputs).
CFG pass/fail bars (outputs)
CFG: pass/fail bars (latest outputs).
CG pass/fail bars (outputs)
CG: pass/fail bars (latest outputs).

Get Started


Minimal steps to run and reproduce metrics:

# setup
bash scripts/setup_venv.sh -e -x
source .venv/bin/activate
cp .env.example .env  # fill provider keys

# evaluate and aggregate
python evaluation/evaluate_multi_models.py --out results/aggregated_summary.json

# render figures (optional)
python evaluation/render_graphs.py

See README and README_zh for datasets, prompts, and script options.

Supplementary Material


See Results and JSON for detailed notes and aggregated metrics.


Citation

@article{ma2023exploring,
  title={Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs},
  author={Ma, Wei and Lin, Zhihao and Liu, Shangqing and Hu, Qiang and Liu, Ye and Wang, Wenhan and Zhang, Cen and Nie, Liming and Li, Li and Liu, Yang and Jiang, Lingxiao},
  journal={arXiv preprint arXiv:2305.12138},
  year={2023}
}


Acknowledgements

We thank Ximing Xing for providing the webpage template.