API Reference

The factly module

CLI tool to evaluate ChatGPT factuality on MMLU benchmark.

The cli module

Factly CLI entrypoint.

class factly.cli.RichGroup(name: str | None = None, commands: MutableMapping[str, Command] | Sequence[Command] | None = None, invoke_without_command: bool = False, no_args_is_help: bool | None = None, subcommand_metavar: str | None = None, chain: bool = False, result_callback: Callable[[...], Any] | None = None, **kwargs: Any)[source]

Custom Click group that displays a banner before the help text.

format_help(ctx, formatter)[source]

Writes the help into the formatter if it exists.

This method is called by Click when the help text is requested.

The __main__ module

factly.__main__.init() None[source]

Run factly.cli.main() when current file is executed by an interpreter.

This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.

The sys.exit() function is called with the return value of factly.cli.main(), following standard UNIX program conventions for exit codes.

The benchmarks module

class factly.benchmarks.MMLUBenchmark(tasks: list[MMLUTask] | None = None, n_shots: int = 0, n_problems_per_task: int | None = None, verbose_mode: bool = False, confinement_instructions: str | None = None, **kwargs)[source]
async a_evaluate(model: FactlyGptModel, workers: int | None = None) float[source]

Evaluate a model on the MMLU benchmark with progress tracking.

Overrides the base MMLU evaluate method to provide a cleaner evaluation process with parallel question processing for better performance.

Parameters:
  • model – The model to evaluate

  • workers – Number of concurrent question evaluations (default: auto-determined)

Returns:

The overall accuracy score

set_concurrency(max_concurrent: int | None = None)[source]

Set the maximum number of concurrent question evaluations.

Parameters:

max_concurrent – Maximum number of concurrent question evaluations

factly.benchmarks.evaluate(instructions: Path, model: str, tasks: list[MMLUTask] | None = None, n_shots: int = 0, workers: int | None = None, verbose: bool = False, plot: bool = False, plot_path: Path | None = None)[source]

Evaluate models with different prompts on the MMLU benchmark.

Parameters:
  • instructions – Path to YAML file with system instructions

  • model – The LLM model to use

  • tasks – List of MMLU tasks to evaluate (defaults to CS and Astronomy)

  • n_shots – Number of shots for few-shot learning (default: 0)

  • workers – Number of concurrent workers for model evaluations (default: auto-determined based on system resources)

  • verbose – Whether to print detailed progress information (default: False)

  • plot – Whether to generate a plot of the results (default: False)

  • plot_path – Path to save the plot (default: ./outputs/factuality-<model>-t<count>.png)

factly.benchmarks.load_instructions(path: Path) list[dict][source]

Load system instructions from a YAML file.

The models module

class factly.models.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, *args, **kwargs)[source]

Factly GPT model.

async ainvoke(prompt: str, schema: BaseModel | None = None) str | dict | BaseModel[source]

Generate a response from the model asynchronously.

get_display_model_name() str[source]

Get the display model name.

Returns:

The display model name

load_model(async_mode: bool = False) OpenAI | AsyncOpenAI[source]

Load the OpenAI client in sync or async mode.

Parameters:

async_mode – Whether to load the async client

Returns:

OpenAI client instance

The plots module

Plotting utilities for Factly benchmarks.

Add a metadata footer to the plot with date, model, and tasks information.

Parameters:
  • fig – The matplotlib figure to add footer to

  • model_name – Name of the model used for evaluation

  • tasks – List of task names used in the evaluation

factly.plots.generate_factuality_comparison_plot(results: list[tuple[float, str]], model_name: str, output_path: Path | None = None, tasks: list[str] | None = None) Path[source]

Generate a bar chart comparing factuality scores of different prompts.

Parameters:
  • results – List of tuples containing (score, prompt_name)

  • model_name – Name of the LLM model used for the benchmark

  • output_path – Path to save the plot (default: creates outputs dir in cwd)

  • tasks – List of MMLU task names used in the benchmark

Returns:

Path to the saved plot file

The resources module

The tasks module

MMLU task registry and management for Factly.

class factly.tasks.TaskCategory(*values)[source]

Categories for organizing MMLU tasks.

factly.tasks.get_all_tasks() list[MMLUTask][source]

Get all supported MMLU tasks.

Returns:

List of all MMLU tasks supported by Factly

factly.tasks.get_task_by_name(name: str) MMLUTask | None[source]

Get an MMLU task by its name (case-insensitive).

Parameters:

name – The name of the task, can be partial match

Returns:

The matching MMLU task or None if not found

factly.tasks.get_tasks_by_category(category: TaskCategory) list[MMLUTask][source]

Get all tasks belonging to a specific category.

Parameters:

category – The category to filter by

Returns:

List of MMLU tasks in the specified category

factly.tasks.list_available_tasks() str[source]

Generate a formatted string listing all available tasks.

Returns:

Formatted string with all available tasks grouped by category

factly.tasks.resolve_tasks(task_names: list[str]) list[MMLUTask][source]

Resolve a list of task names to actual MMLU tasks.

Parameters:

task_names – List of task names provided by the user

Returns:

List of resolved MMLU tasks

Raises:

ValueError – If any task name cannot be resolved