API Reference

The `factly` module

CLI tool to evaluate ChatGPT factuality on MMLU benchmark.

The `cli` module

Factly CLI entrypoint.

class factly.cli.RichGroup(name: str | None = None, commands: MutableMapping[str, Command] | Sequence[Command] | None = None, invoke_without_command: bool = False, no_args_is_help: bool | None = None, subcommand_metavar: str | None = None, chain: bool = False, result_callback: Callable[[...], Any] | None = None, **kwargs: Any)[source]

Custom Click group that displays a banner before the help text.

format_help(ctx, formatter)[source]

Writes the help into the formatter if it exists.

This method is called by Click when the help text is requested.

The `main` module

factly.__main__.init() → None[source]

Run factly.cli.main() when current file is executed by an interpreter.

This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.

The sys.exit() function is called with the return value of factly.cli.main(), following standard UNIX program conventions for exit codes.

The `benchmarks` module

class factly.benchmarks.MMLUBenchmark(tasks: list[MMLUTask] | None = None, n_shots: int = 0, n_problems_per_task: int | None = None, verbose_mode: bool = False, confinement_instructions: str | None = None, **kwargs)[source]

async a_evaluate(model: FactlyGptModel, workers: int | None = None) → float[source]

Evaluate a model on the MMLU benchmark with progress tracking.

Overrides the base MMLU evaluate method to provide a cleaner evaluation process with parallel question processing for better performance.

Parameters:

model – The model to evaluate
workers – Number of concurrent question evaluations (default: auto-determined)

Returns:

The overall accuracy score

set_concurrency(max_concurrent: int | None = None)[source]

Set the maximum number of concurrent question evaluations.

Parameters:: max_concurrent – Maximum number of concurrent question evaluations

factly.benchmarks.evaluate(instructions: Path, model: str, tasks: list[MMLUTask] | None = None, n_shots: int = 0, workers: int | None = None, verbose: bool = False, plot: bool = False, plot_path: Path | None = None)[source]

Evaluate models with different prompts on the MMLU benchmark.

Parameters:

instructions – Path to YAML file with system instructions
model – The LLM model to use
tasks – List of MMLU tasks to evaluate (defaults to CS and Astronomy)
n_shots – Number of shots for few-shot learning (default: 0)
workers – Number of concurrent workers for model evaluations (default: auto-determined based on system resources)
verbose – Whether to print detailed progress information (default: False)
plot – Whether to generate a plot of the results (default: False)
plot_path – Path to save the plot (default: ./outputs/factuality-<model>-t<count>.png)

factly.benchmarks.load_instructions(path: Path) → list[dict][source]: Load system instructions from a YAML file.

The `models` module

class factly.models.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, *args, **kwargs)[source]

Factly GPT model.

async ainvoke(prompt: str, schema: BaseModel | None = None) → str | dict | BaseModel[source]: Generate a response from the model asynchronously.

get_display_model_name() → str[source]

Get the display model name.

Returns:: The display model name

load_model(async_mode: bool = False) → OpenAI | AsyncOpenAI[source]

Load the OpenAI client in sync or async mode.

Parameters:: async_mode – Whether to load the async client
Returns:: OpenAI client instance

The `plots` module

Plotting utilities for Factly benchmarks.

factly.plots.add_metadata_footer(fig: Figure, model_name: str, tasks: list[str] | None = None) → None[source]

Add a metadata footer to the plot with date, model, and tasks information.

Parameters:

fig – The matplotlib figure to add footer to
model_name – Name of the model used for evaluation
tasks – List of task names used in the evaluation

factly.plots.generate_factuality_comparison_plot(results: list[tuple[float, str]], model_name: str, output_path: Path | None = None, tasks: list[str] | None = None) → Path[source]

Generate a bar chart comparing factuality scores of different prompts.

Parameters:

results – List of tuples containing (score, prompt_name)
model_name – Name of the LLM model used for the benchmark
output_path – Path to save the plot (default: creates outputs dir in cwd)
tasks – List of MMLU task names used in the benchmark

Returns:

Path to the saved plot file

The `resources` module

The `tasks` module

MMLU task registry and management for Factly.

class factly.tasks.TaskCategory(*values)[source]: Categories for organizing MMLU tasks.

factly.tasks.get_all_tasks() → list[MMLUTask][source]

Get all supported MMLU tasks.

Returns:: List of all MMLU tasks supported by Factly

factly.tasks.get_task_by_name(name: str) → MMLUTask | None[source]

Get an MMLU task by its name (case-insensitive).

Parameters:: name – The name of the task, can be partial match
Returns:: The matching MMLU task or None if not found

factly.tasks.get_tasks_by_category(category: TaskCategory) → list[MMLUTask][source]

Get all tasks belonging to a specific category.

Parameters:: category – The category to filter by
Returns:: List of MMLU tasks in the specified category

factly.tasks.list_available_tasks() → str[source]

Generate a formatted string listing all available tasks.

Returns:: Formatted string with all available tasks grouped by category

factly.tasks.resolve_tasks(task_names: list[str]) → list[MMLUTask][source]

Resolve a list of task names to actual MMLU tasks.

Parameters:: task_names – List of task names provided by the user
Returns:: List of resolved MMLU tasks
Raises:: ValueError – If any task name cannot be resolved

API Reference

The factly module

The cli module

The __main__ module

The benchmarks module

The models module

The plots module

The resources module

The tasks module

The `factly` module

The `cli` module

The `main` module

The `benchmarks` module

The `models` module

The `plots` module

The `resources` module

The `tasks` module