Usage
This document provides detailed instructions for using the factly command-line tool.
Command Line Interface
The primary entrypoint for Factly is the factly command-line interface, which provides tools to evaluate the factuality of Large Language Models (LLMs) on the MMLU benchmark.
Basic Usage
# Run factuality evaluation with default settings
factly evaluate
# Run evaluation and generate plots
factly evaluate --plot
# Get help on all available options
factly evaluate --help
# List available MMLU tasks
factly list-tasks
Command Structure
Factly provides the following commands:
factly [OPTIONS] COMMAND [ARGS]...
Main Commands:
evaluate: Run factuality evaluation on MMLU benchmarklist-tasks: List all available MMLU tasks
Common Global Options:
--help: Show help message and exit--version: Show version and exit
Command Line Options for evaluate
Option |
Description |
Default |
|---|---|---|
|
The OpenAI model to use for evaluation |
From |
|
MMLU task categories to evaluate (can be repeated) |
All tasks |
|
Number of examples for few-shot learning |
|
|
Maximum number of concurrent API requests |
Auto-detected based on system resources |
|
Path to YAML file with system instruction variants. |
|
|
Generate visualization plots |
|
|
Path to save the plot |
|
|
Enable verbose output |
|
|
Show help message and exit |
Command Line Options for list-tasks
Option |
Description |
Default |
|---|---|---|
|
Show help message and exit |
Advanced Usage
Task Selection
You can select specific MMLU tasks to evaluate:
# Evaluate specific model on selected MMLU tasks
factly evaluate --model gpt-4o --tasks mathematics --tasks high_school_us_history
# Evaluate on STEM tasks only
factly evaluate --tasks STEM
# Evaluate on business-related tasks
factly evaluate --tasks BUSINESS
Few-Shot Learning
Configure the number of examples provided for few-shot learning:
# Zero-shot evaluation (default)
factly evaluate --n-shots 0
# 3-shot evaluation
factly evaluate --n-shots 3
# 5-shot evaluation
factly evaluate --n-shots 5
Performance Optimization
Factly uses asynchronous concurrent processing to maximize evaluation throughput.
It evaluates multiple questions concurrently for each model, significantly reducing
total evaluation time. You can control the concurrency level with the --workers
parameter:
# Auto-determine optimal concurrency (default)
factly evaluate --tasks STEM
# Set concurrency level explicitly (process 20 questions in parallel)
factly evaluate --tasks STEM --workers 20
The implementation uses asyncio and semaphores for controlled concurrency with automatic
resource detection for optimal performance across different environments.
System Instructions
Factly supports different system instructions for prompt engineering experiments:
# Use the default instruction from instructions.yaml in current directory
factly evaluate
# Use a custom instructions defined in ~/path/to/instructions.yaml file
factly evaluate --instructions ~/path/to/instructions.yaml
By default instructions should be defined in the instructions.yaml file in current directory.
Each instruction should provide a different way to guide the model’s behavior when responding to questions.
Examples
Basic Evaluation
# Run basic evaluation with default settings
factly evaluate
# Run evaluation and generate plots
factly evaluate --plot
# Run verbose evaluation with plots
factly evaluate --verbose --plot
Subject-Specific Evaluation
# Evaluate mathematics knowledge
factly evaluate --tasks mathematics --n-shots 3 --plot
# Evaluate humanities subjects
factly evaluate --tasks high_school_european_history --tasks high_school_us_history --plot
# Evaluate computer science knowledge
factly evaluate --tasks computer_science --verbose --plot
Customized Evaluation
# Customize API settings and system instruction
export OPENAI_API_KEY=https://your-proxy.example.com/v1
factly evaluate --model gpt-4o-mini --instructions ~/path/to/instructions.yaml
Environment Variables
Instead of specifying command-line arguments each time, you can set environment variables in the .env file:
# API Configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL=gpt-4o
OPENAI_API_BASE=your_api_base_url # Optional