llm-benchmarking

Here are 29 public repositories matching this topic...

robertvacareanu / llm4regression

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update

linear-regression sklearn regression regression-models large-language-models llm llms llm-inference llm-benchmarking

Updated Sep 10, 2024
Python

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Apr 12, 2025
HTML

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

evaluation llm llm-evaluation llm-benchmarking generative-ai-benchmarking

Updated Apr 9, 2025
HTML

lakeraai / pint-benchmark

Star

A benchmark for prompt injection detection systems.

benchmark llm prompt-injection llm-security llm-benchmarking

Updated Feb 6, 2025
Jupyter Notebook

asimsinan / LLM-Research

Star

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

arxiv-papers large-language-models llm llms llm-datasets llm-tools buyuk-dil-modelleri llm-research llm-theses llm-benchmarking llm-frameworks

Updated Oct 8, 2024
Python

MJ-Bench / MJ-Bench

Star

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models multimodal-foundation-model llm-benchmarking llm-as-a-judge multimodal-judge

Updated Feb 23, 2025
Jupyter Notebook

nl4opt / ORQA

Star

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.

optimization linear-programming operations-research mathematical-modelling mixed-integer-programming multi-choice llm llm-reasoning llm-benchmarking llm4math aaai2025 ai4or llm4or llm4opt

Updated Mar 20, 2025

AKSW / LLM-KG-Bench

Star

LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.

sparql rdf knowledge-graph large-language-models llm llm-benchmarking

Updated Apr 12, 2025
Python

lechmazur / deception

Star

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

damianomarsili / VADAR

Star

Program synthesis for 3D spatial reasoning

program-synthesis 3d spatial-reasoning llms llm-benchmarking

Updated Feb 20, 2025
Jupyter Notebook

aws-samples / fm-leaderboarder

Star

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated Oct 31, 2024
Python

DMLinc / txt-2-sql-benchmark

Star

An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions

benchmark text-to-sql llm-benchmarking

Updated Aug 29, 2024
Jupyter Notebook

AUCOHL / RTL-Repo

Star

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24

verilog rtl-design llm llm-benchmarking

Updated Jun 5, 2024
Python

tongye98 / Awesome-Code-Benchmark

Star

A comprehensive code domain benchmark review of LLM researches.

data-science awesome benchmarks code-generation code-completion bug-fixing reasoning multimodal codellm code-efficiency codellms llm-benchmarking

Updated Apr 13, 2025

Cristian-Curaba / CryptoFormalEval

Star

We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.

cryptography communication-protocol evaluation vulnerability-detection llm-reasoning llm-benchmarking llm-based-agents

Updated Mar 26, 2025
Haskell

ronniross / coreAGIprotocol

Star

The Core AGI Protocol provides a framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

machine-learning agi dataset artificial-neural-networks artificial-general-intelligence machine-learning-library datasets quantum-field-theory machine-learning-projects artificial-gene-regulatory-networks llm llms llm-datasets quantum-fields llms-benchmarking llm-benchmarking artificial-general-super-intelligence agi-development

Updated Apr 13, 2025

mrigankpawagi / HinglishEval

Star

Evaluating the Effectiveness of Code-generation Models on Hinglish Prompts

code-generation hinglish-dataset llm-benchmarking

Updated Apr 4, 2025
Python

cburst / LLMscripting

Star

This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting

education llm llms llm-apps llm-evaluation llm-agents llm-benchmarking

Updated Mar 20, 2025
Python

ssai-trento / LLM-zero-shot-NL

Star

Python code for the paper "LLMs are zero-shot next-location predictors" by Beneduce et al.

human-mobility llm llms next-location-prediction llm-benchmarking next-location

Updated Sep 1, 2024
Python

levitation-opensource / bioblue

Star

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLM-s with simplified observation format. The benchmark themes include multi-objective homeostasis, (multi-objective) diminishing returns, complementary goods, sustainability, multi-agent resource sharing.

python benchmarking sustainability multi-agent multi-objective ai-safety homeostasis ai-alignment llm-benchmarking diminishing-returns complementary-goods

Updated Apr 6, 2025
Python

Improve this page

Add a description, image, and links to the llm-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-benchmarking

Here are 29 public repositories matching this topic...

robertvacareanu / llm4regression

lechmazur / confabulations

alopatenko / LLMEvaluation

lakeraai / pint-benchmark

asimsinan / LLM-Research

MJ-Bench / MJ-Bench

nl4opt / ORQA

AKSW / LLM-KG-Bench

lechmazur / deception

damianomarsili / VADAR

aws-samples / fm-leaderboarder

DMLinc / txt-2-sql-benchmark

AUCOHL / RTL-Repo

tongye98 / Awesome-Code-Benchmark

Cristian-Curaba / CryptoFormalEval

ronniross / coreAGIprotocol

mrigankpawagi / HinglishEval

cburst / LLMscripting

ssai-trento / LLM-zero-shot-NL

levitation-opensource / bioblue

Improve this page

Add this topic to your repo