API Reference

The factly module

CLI tool to evaluate ChatGPT factuality on MMLU benchmark.

The __main__ module

factly.__main__.init() None[source]

Run factly.cli.main() when current file is executed by an interpreter.

This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.

The sys.exit() function is called with the return value of factly.cli.main(), following standard UNIX program conventions for exit codes.

The factly.llms module

class factly.llms.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]

Factly GPT model.

__init__(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]

Initialize the Factly GPT model.

Parameters:
  • model – Model identifier, can be “<provider>/<model>” for LiteLLM or “<model>” for direct provider models

  • system_prompt – System prompt to use for generating responses

  • prompt_name – Display name for this model configuration in reports

  • temperature – Sampling temperature between 0.0 and 2.0

  • top_p – Nucleus sampling parameter between 0.0 and 1.0

  • max_tokens – Maximum number of tokens to generate

async ainvoke(prompt: str, schema: BaseModel | None = None) str | dict | BaseModel[source]

Generate a response from the model asynchronously.

load_model(async_mode: bool = False) OpenAI | AsyncOpenAI[source]

Load the OpenAI client in sync or async mode.

Parameters:

async_mode – Whether to load the async client

Returns:

OpenAI client instance

class factly.llms.FactlyModelMixin[source]

Mixin providing common functionality for Factly LLM models.

This mixin standardizes common operations across different model implementations such as message formatting and model name handling.

__weakref__

list of weak references to the object

create_messages(prompt: str)[source]

Format prompt into chat completion messages.

Parameters:

prompt – User input prompt

Returns:

List of message objects with appropriate roles

static get_actual_model_name(model_name: str) str[source]

Extract base model name from provider-prefixed format.

Parameters:

model_name – Original model identifier, potentially in “<provider>/<model>” format

Returns:

The model name without provider prefix

get_display_model_name() str[source]

Get model name for display in UI and reports.

Returns:

Model name suitable for display

The factly.llms.base_model module

Common mixin for Factly LLM model implementations.

class factly.llms.base_model.FactlyModelMixin[source]

Mixin providing common functionality for Factly LLM models.

This mixin standardizes common operations across different model implementations such as message formatting and model name handling.

create_messages(prompt: str)[source]

Format prompt into chat completion messages.

Parameters:

prompt – User input prompt

Returns:

List of message objects with appropriate roles

static get_actual_model_name(model_name: str) str[source]

Extract base model name from provider-prefixed format.

Parameters:

model_name – Original model identifier, potentially in “<provider>/<model>” format

Returns:

The model name without provider prefix

get_display_model_name() str[source]

Get model name for display in UI and reports.

Returns:

Model name suitable for display

The factly.llms.openai_model module

class factly.llms.openai_model.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]

Factly GPT model.

async ainvoke(prompt: str, schema: BaseModel | None = None) str | dict | BaseModel[source]

Generate a response from the model asynchronously.

load_model(async_mode: bool = False) OpenAI | AsyncOpenAI[source]

Load the OpenAI client in sync or async mode.

Parameters:

async_mode – Whether to load the async client

Returns:

OpenAI client instance

The cli module

Factly CLI entrypoint.

class factly.cli.RichGroup(name: str | None = None, commands: MutableMapping[str, Command] | Sequence[Command] | None = None, invoke_without_command: bool = False, no_args_is_help: bool | None = None, subcommand_metavar: str | None = None, chain: bool = False, result_callback: Callable[[...], Any] | None = None, **kwargs: Any)[source]

Custom Click group that displays a banner before the help text.

format_help(ctx, formatter)[source]

Writes the help into the formatter if it exists.

This method is called by Click when the help text is requested.

Get copyright info.

factly.cli.get_version() str[source]

Get version info.

The benchmarks module

class factly.benchmarks.MMLUBenchmark(tasks: list[MMLUTask] | None = None, n_shots: int = 0, n_problems_per_task: int | None = None, **kwargs)[source]
async a_evaluate(model: FactlyGptModel, workers: int | None = None) float[source]

Evaluate a model on the MMLU benchmark with progress tracking.

Overrides the base MMLU evaluate method to provide a cleaner evaluation process with parallel question processing for better performance.

Parameters:
  • model – The model to evaluate

  • workers – Number of concurrent question evaluations (default: auto-determined)

Returns:

The overall accuracy score

set_concurrency(max_concurrent: int | None = None)[source]

Set the maximum number of concurrent question evaluations.

Parameters:

max_concurrent – Maximum number of concurrent question evaluations

factly.benchmarks.evaluate(instructions: Path, settings: FactlySettings, tasks: list[MMLUTask] | None = None, workers: int | None = None, plot: bool = False, plot_path: Path | None = None)[source]

Evaluate models with different prompts on the MMLU benchmark.

Parameters:
  • instructions – Path to YAML file with system instructions

  • settings – FactlySettings object containing model and inference settings

  • tasks – List of MMLU tasks to evaluate (defaults to CS and Astronomy)

  • workers – Number of concurrent workers for model evaluations (default: auto-determined based on system resources)

  • plot – Whether to generate a plot of the results (default: False)

  • plot_path – Path to save the plot (default: ./outputs/factuality-<model>-t<count>.png)

factly.benchmarks.load_instructions(path: Path) list[dict][source]

Load system instructions from a YAML file.

The plots module

Plotting utilities for Factly benchmarks.

Add a metadata footer to the plot with date, model, and tasks information.

Parameters:
  • fig – The matplotlib figure to add footer to

  • model_name – Name of the model used for evaluation

  • tasks – List of task names used in the evaluation

factly.plots.generate_factuality_comparison_plot(results: list[tuple[float, str]], model_name: str, output_path: Path | None = None, tasks: list[str] | None = None) Path[source]

Generate a bar chart comparing factuality scores of different prompts.

Parameters:
  • results – List of tuples containing (score, prompt_name)

  • model_name – Name of the LLM model used for the benchmark

  • output_path – Path to save the plot (default: creates outputs dir in cwd)

  • tasks – List of MMLU task names used in the benchmark

Returns:

Path to the saved plot file

The resources module

The settings module

Settings module for Factly CLI.

Defines configuration models for API, inference, and overall application settings.

class factly.settings.FactlySettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, model: ~factly.settings.ModelSettings = <factory>, inference: ~factly.settings.InferenceSettings = <factory>)[source]

Aggregated settings for the Factly CLI, including model and inference configuration.

model

Model API and authentication settings.

Type:

ModelSettings

inference

Inference-time decoding parameters.

Type:

InferenceSettings

classmethod create(**kwargs) FactlySettings[source]

Create FactlySettings with optional overrides.

This factory method handles nested configuration with dictionaries.

Parameters:

**kwargs – Configuration overrides including nested dictionaries for model and inference settings.

Returns:

A settings instance.

Return type:

FactlySettings

Example

>>> # Create with API key and custom temperature
>>> settings = FactlySettings.create(
...     model={"api_key": "sk-abc123", "model": "gpt-4o"},
...     inference={"temperature": 0.1}
... )
>>>
>>> # Alternatively, update settings after creation:
>>> settings = FactlySettings()
>>> settings.model.api_key = "sk-abc123"
>>> settings.inference.temperature = 0.1
classmethod from_cli(model: str | None = None, api_key: str | None = None, api_base: str | None = None, temperature: float | None = None, top_p: float | None = None, max_tokens: int | None = None, n_shots: int | None = None) FactlySettings[source]

Create settings by combining CLI arguments with environment variables.

CLI arguments take precedence over environment variables and defaults. Only non-None CLI values will override settings from the environment.

Parameters:
  • model – Model name (e.g., “gpt-4o”)

  • api_key – API key for the model provider

  • api_base – Base URL for the API

  • temperature – Sampling temperature

  • top_p – Nucleus sampling parameter

  • max_tokens – Maximum tokens to generate

  • n_shots – Number of examples for few-shot learning

Returns:

Combined settings with proper priority

Return type:

FactlySettings

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class factly.settings.InferenceSettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, temperature: Annotated[float, Ge(ge=0.0), Le(le=2.0)] = 0.0, top_p: Annotated[float, Gt(gt=0.0), Le(le=1.0)] = 1.0, max_tokens: Annotated[int, Gt(gt=0)] = 256, n_shots: Annotated[int, Ge(ge=0)] = 0)[source]

Inference-time parameters for LLM decoding, following MMLU best practices.

temperature

Sampling temperature. Default set to 0.0 to ensure deterministic, reproducible outputs by disabling sampling randomness.

Type:

float

top_p

Nucleus sampling parameter. Controls how much of the probability mass the model is allowed to sample from. Default set to 1.0 to disable nucleus sampling, guaranteeing the model always selects the most probable token.

Type:

float

max_tokens

Maximum tokens to generate. Default set to 256 to allow sufficient space for model reasoning. For standard MMLU, you typically want just 1 token (A/B/C/D answers), but setting max_tokens: 1 will break benchmarks if your prompts expect structured outputs (e.g., JSON) or encourage reasoning before answering. With higher max_tokens, you may need to post-process results to extract final answers.

Type:

int

n_shots

Number of examples for few-shot learning. Default set to 0 for zero-shot evaluation. Increasing this value provides more demonstration examples in prompts to help the model understand the task format.

Type:

int

Note

When using n_shots > 0, consider setting max_tokens > 1 to allow the model to follow the reasoning patterns demonstrated in few-shot examples. Setting max_tokens=1 with n_shots > 0 may cause the model to ignore the reasoning pattern in examples and only output a token.

classmethod create(**kwargs) InferenceSettings[source]

Create an InferenceSettings instance with optional overrides.

Parameters:

**kwargs – Override default settings values.

Returns:

A settings instance.

Return type:

InferenceSettings

classmethod for_mmlu(n_shots: int = 0) InferenceSettings[source]

Create inference settings configured for traditional MMLU benchmarking.

Uses max_tokens=1 for single-letter answers, which is the canonical setup for standard MMLU evaluation where only a single token (A/B/C/D) is expected.

Returns:

MMLU-optimized settings (temperature=0, top_p=1, max_tokens=1).

Return type:

InferenceSettings

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class factly.settings.ModelSettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, api_base: str = 'https://api.openai.com/v1', model: str = 'gpt-4o', api_key: str | None = None)[source]

Configuration for the LLM API connection and model selection.

api_base

Base URL for the model API endpoint.

Type:

str

model

Model name or identifier (e.g., ‘gpt-4o’).

Type:

str

api_key

API key for authenticating with the model provider. Set to None for local models that don’t require authentication.

Type:

Optional[str]

classmethod create(**kwargs) ModelSettings[source]

Create a ModelSettings instance with optional overrides.

Parameters:

**kwargs – Override default settings values.

Returns:

A settings instance.

Return type:

ModelSettings

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': '.env', 'env_file_encoding': 'utf-8', 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'OPENAI_', 'extra': 'ignore', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

The tasks module

MMLU task registry and management for Factly.

class factly.tasks.TaskCategory(*values)[source]

Categories for organizing MMLU tasks.

factly.tasks.get_all_tasks() list[MMLUTask][source]

Get all supported MMLU tasks.

Returns:

List of all MMLU tasks supported by Factly

factly.tasks.get_task_by_name(name: str) MMLUTask | None[source]

Get an MMLU task by its name (case-insensitive).

Parameters:

name – The name of the task, can be partial match

Returns:

The matching MMLU task or None if not found

factly.tasks.get_tasks_by_category(category: TaskCategory) list[MMLUTask][source]

Get all tasks belonging to a specific category.

Parameters:

category – The category to filter by

Returns:

List of MMLU tasks in the specified category

factly.tasks.list_available_tasks() str[source]

Generate a formatted string listing all available tasks.

Returns:

Formatted string with all available tasks grouped by category

factly.tasks.resolve_tasks(task_names: list[str]) list[MMLUTask][source]

Resolve a list of task names to actual MMLU tasks.

Parameters:

task_names – List of task names provided by the user

Returns:

List of resolved MMLU tasks

Raises:

ValueError – If any task name cannot be resolved