llmevaluation

Pros

  • measures meaningful skills (coding)
  • automated evaluation (using unitest)
  • mitigates problems of older approaches, e.g.:
    • BLEU (match-based, uses reference solutions) punishes candidates with very short length (brevity penalty).

Cons

  • small size, prone to overfitting
  • limited scope (python, single function)
  • potential contamination (models trained on new Github scrapes) because the dataset is available on Github