API Reference
The factly module
CLI tool to evaluate ChatGPT factuality on MMLU benchmark.
The cli module
Factly CLI entrypoint.
- class factly.cli.RichGroup(name: str | None = None, commands: MutableMapping[str, Command] | Sequence[Command] | None = None, invoke_without_command: bool = False, no_args_is_help: bool | None = None, subcommand_metavar: str | None = None, chain: bool = False, result_callback: Callable[[...], Any] | None = None, **kwargs: Any)[source]
Custom Click group that displays a banner before the help text.
The __main__ module
- factly.__main__.init() None[source]
Run factly.cli.main() when current file is executed by an interpreter.
This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.
The
sys.exit()function is called with the return value offactly.cli.main(), following standard UNIX program conventions for exit codes.
The benchmarks module
- class factly.benchmarks.MMLUBenchmark(tasks: list[MMLUTask] | None = None, n_shots: int = 0, n_problems_per_task: int | None = None, verbose_mode: bool = False, confinement_instructions: str | None = None, **kwargs)[source]
- async a_evaluate(model: FactlyGptModel, workers: int | None = None) float[source]
Evaluate a model on the MMLU benchmark with progress tracking.
Overrides the base MMLU evaluate method to provide a cleaner evaluation process with parallel question processing for better performance.
- Parameters:
model – The model to evaluate
workers – Number of concurrent question evaluations (default: auto-determined)
- Returns:
The overall accuracy score
- factly.benchmarks.evaluate(instructions: Path, model: str, tasks: list[MMLUTask] | None = None, n_shots: int = 0, workers: int | None = None, verbose: bool = False, plot: bool = False, plot_path: Path | None = None)[source]
Evaluate models with different prompts on the MMLU benchmark.
- Parameters:
instructions – Path to YAML file with system instructions
model – The LLM model to use
tasks – List of MMLU tasks to evaluate (defaults to CS and Astronomy)
n_shots – Number of shots for few-shot learning (default: 0)
workers – Number of concurrent workers for model evaluations (default: auto-determined based on system resources)
verbose – Whether to print detailed progress information (default: False)
plot – Whether to generate a plot of the results (default: False)
plot_path – Path to save the plot (default: ./outputs/factuality-<model>-t<count>.png)
The models module
The plots module
Plotting utilities for Factly benchmarks.
Add a metadata footer to the plot with date, model, and tasks information.
- Parameters:
fig – The matplotlib figure to add footer to
model_name – Name of the model used for evaluation
tasks – List of task names used in the evaluation
- factly.plots.generate_factuality_comparison_plot(results: list[tuple[float, str]], model_name: str, output_path: Path | None = None, tasks: list[str] | None = None) Path[source]
Generate a bar chart comparing factuality scores of different prompts.
- Parameters:
results – List of tuples containing (score, prompt_name)
model_name – Name of the LLM model used for the benchmark
output_path – Path to save the plot (default: creates outputs dir in cwd)
tasks – List of MMLU task names used in the benchmark
- Returns:
Path to the saved plot file
The resources module
The tasks module
MMLU task registry and management for Factly.
- factly.tasks.get_all_tasks() list[MMLUTask][source]
Get all supported MMLU tasks.
- Returns:
List of all MMLU tasks supported by Factly
- factly.tasks.get_task_by_name(name: str) MMLUTask | None[source]
Get an MMLU task by its name (case-insensitive).
- Parameters:
name – The name of the task, can be partial match
- Returns:
The matching MMLU task or None if not found
- factly.tasks.get_tasks_by_category(category: TaskCategory) list[MMLUTask][source]
Get all tasks belonging to a specific category.
- Parameters:
category – The category to filter by
- Returns:
List of MMLU tasks in the specified category