Usage
This document provides detailed instructions for using the factly command-line tool.
Command Line Interface
The primary entrypoint for Factly is the factly command-line interface, which provides tools to evaluate the factuality of LLMs on the MMLU benchmark.
Basic Usage
# Run factuality evaluation with default settings
factly mmlu
# Run evaluation and generate plots
factly mmlu --plot
# Get help on all available options
factly mmlu --help
# List available MMLU tasks
factly list-tasks
Command Structure
Factly provides the following commands:
factly [OPTIONS] COMMAND [ARGS]...
Main Commands:
mmlu: Run factuality evaluation on MMLU benchmarklist-tasks: List all available MMLU tasks
Common Global Options:
--help: Show help message and exit--version: Show version and exit
Command Line Options for mmlu
- --instructions file
Path to YAML file with system instruction variants. Default:
./instructions.yaml- -m name, --model name
Model name to use for evaluation. Default:
OPENAI_MODELfrom environment variables,.envfile, orgpt-4oas a fallback. and long descriptions- -u host, --url host
Model API URL to use for evaluation. Default:
OPENAI_API_BASEfrom environment variables,.envfile, or unset.- -a key, --api-key key
Model API key to use for evaluation. Default:
OPENAI_API_KEYfrom environment variables,.envfile, or unset.- -t float, --temperature float
Controls randomness in token selection. For MMLU benchmarking,
0.0(deterministic) is the canonical setting to ensure reproducible results and measure raw model knowledge. Higher values introduce randomness, which isn’t suitable for standard benchmarking. Default:0.0.- --top-p float, --top-p float
Nucleus sampling parameter that controls how much of the probability mass the model samples from. For benchmarking, keep at
1.0to disable nucleus sampling. Only modify when exploring controlled randomness in outputs. Default:1.0.- --max-tokens int, --max-tokens int
Maximum tokens per response. For standard MMLU with single-letter answers (A/B/C/D), use
1. For structured outputs or when using system prompts that encourage reasoning, use higher values (256or more) and post-process to extract final answers. Default:256.- --n-shots int, --n-shots int
Number of examples for few-shot learning. Default set to 0 for zero-shot evaluation. Increasing this value provides more demonstration examples in prompts to help the model understand the task format. Default:
0.- --tasks name, --tasks name
MMLU task categories to evaluate (can be repeated). Default: All tasks.
- -j int, --workers int
Maximum number of concurrent question evaluations. Default: Auto-detected based on system resources.
- --plot
Generate visualization plots. Default:
False.- --plot-path path
Path to save the plot. Default:
./outputs/factuality-<model>-t<count>.png.- --verbose
Show detailed progress information during evaluation. Default:
False.- --help
Show help message and exit.
Note
For canonical MMLU evaluation (comparable to published benchmarks), use
--n-shots 0with--max-tokens 1When using few-shot learning (
--n-shots >0), consider setting--max-tokensto a higher value (≥256) to allow the model to follow reasoning patterns from the examplesIf you must use
--n-shots >0with--max-tokens 1, ensure your few-shot examples only demonstrate single-token answers without reasoningFor structured output (like JSON) or when using system prompts that encourage reasoning, always use
--max-tokens >1regardless of the--n-shotsvalue
Command Line Options for list-tasks
- --help
Show help message and exit.
Advanced Usage
Task Selection
You can select specific MMLU tasks to evaluate:
# Evaluate specific model on selected MMLU tasks
factly mmlu --model gpt-4o --tasks mathematics --tasks high_school_us_history
# Evaluate on STEM tasks only
factly mmlu --tasks STEM
# Evaluate on business-related tasks
factly mmlu --tasks BUSINESS
Few-Shot Learning
Configure the number of examples provided for few-shot learning:
# Zero-shot evaluation (default)
factly mmlu --n-shots 0
# 3-shot evaluation
factly mmlu --n-shots 3
# 5-shot evaluation
factly mmlu --n-shots 5
Performance Optimization
Factly uses asynchronous concurrent processing to maximize evaluation throughput.
It evaluates multiple questions concurrently for each model, significantly reducing
total evaluation time. You can control the concurrency level with the --workers
parameter:
# Auto-determine optimal concurrency (default)
factly mmlu --tasks STEM
# Set concurrency level explicitly (process 20 questions in parallel)
factly mmlu --tasks STEM --workers 20
The implementation uses asyncio and semaphores for controlled concurrency with automatic
resource detection for optimal performance across different environments.
System Instructions
Factly supports different system instructions for prompt engineering experiments:
# Use the default instruction from instructions.yaml in current directory
factly mmlu
# Use a custom instructions defined in ~/path/to/instructions.yaml file
factly mmlu --instructions ~/path/to/instructions.yaml
By default instructions should be defined in the instructions.yaml file in current directory.
Each instruction should provide a different way to guide the model’s behavior when responding to questions.
Examples
Basic Evaluation
# Run basic evaluation with default settings
factly mmlu
# Run evaluation and generate plots
factly mmlu --plot
# Run verbose evaluation with plots
factly mmlu --verbose --plot
Subject-Specific Evaluation
# Evaluate mathematics knowledge
factly mmlu --tasks mathematics --n-shots 3 --plot
# Evaluate humanities subjects
factly mmlu --tasks high_school_european_history --tasks high_school_us_history --plot
# Evaluate computer science knowledge
factly mmlu --tasks computer_science --verbose --plot
Customized Evaluation
# Customize API settings and system instruction
factly mmlu \
-m gpt-4o-mini \
-u https://your-proxy.example.com/v1 \
-a your_api_key_here \
--instructions ~/path/to/instructions.yaml
# Customize model inference parameters
factly mmlu \
--model gpt-4o \
--temperature 0.7 \
--top-p 0.95 \
--max-tokens 512 \
--tasks mathematics \
--plot
Environment Variables
Instead of specifying command-line arguments each time, you can set environment variables in the .env file:
# API Configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL=gpt-4o
OPENAI_API_BASE=your_api_base_url # Optional