API Reference

The `factly` module

CLI tool to evaluate ChatGPT factuality on MMLU benchmark.

The `main` module

factly.__main__.init() → None[source]

Run factly.cli.main() when current file is executed by an interpreter.

This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.

The sys.exit() function is called with the return value of factly.cli.main(), following standard UNIX program conventions for exit codes.

The `factly.llms` module

class factly.llms.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]

Factly GPT model.

__init__(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]

Initialize the Factly GPT model.

Parameters:

model – Model identifier, can be “<provider>/<model>” for LiteLLM or “<model>” for direct provider models
system_prompt – System prompt to use for generating responses
prompt_name – Display name for this model configuration in reports
temperature – Sampling temperature between 0.0 and 2.0
top_p – Nucleus sampling parameter between 0.0 and 1.0
max_tokens – Maximum number of tokens to generate

async ainvoke(prompt: str, schema: BaseModel | None = None) → str | dict | BaseModel[source]: Generate a response from the model asynchronously.

load_model(async_mode: bool = False) → OpenAI | AsyncOpenAI[source]

Load the OpenAI client in sync or async mode.

Parameters:: async_mode – Whether to load the async client
Returns:: OpenAI client instance

class factly.llms.FactlyModelMixin[source]

Mixin providing common functionality for Factly LLM models.

This mixin standardizes common operations across different model implementations such as message formatting and model name handling.

__weakref__: list of weak references to the object

create_messages(prompt: str)[source]

Format prompt into chat completion messages.

Parameters:: prompt – User input prompt
Returns:: List of message objects with appropriate roles

static get_actual_model_name(model_name: str) → str[source]

Extract base model name from provider-prefixed format.

Parameters:: model_name – Original model identifier, potentially in “<provider>/<model>” format
Returns:: The model name without provider prefix

get_display_model_name() → str[source]

Get model name for display in UI and reports.

Returns:: Model name suitable for display

The `factly.llms.base_model` module

Common mixin for Factly LLM model implementations.

class factly.llms.base_model.FactlyModelMixin[source]

Mixin providing common functionality for Factly LLM models.

This mixin standardizes common operations across different model implementations such as message formatting and model name handling.

create_messages(prompt: str)[source]

Format prompt into chat completion messages.

Parameters:: prompt – User input prompt
Returns:: List of message objects with appropriate roles

static get_actual_model_name(model_name: str) → str[source]

Extract base model name from provider-prefixed format.

Parameters:: model_name – Original model identifier, potentially in “<provider>/<model>” format
Returns:: The model name without provider prefix

get_display_model_name() → str[source]

Get model name for display in UI and reports.

Returns:: Model name suitable for display

The `factly.llms.openai_model` module

class factly.llms.openai_model.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]

Factly GPT model.

async ainvoke(prompt: str, schema: BaseModel | None = None) → str | dict | BaseModel[source]: Generate a response from the model asynchronously.

load_model(async_mode: bool = False) → OpenAI | AsyncOpenAI[source]

Load the OpenAI client in sync or async mode.

Parameters:: async_mode – Whether to load the async client
Returns:: OpenAI client instance

The `cli` module

Factly CLI entrypoint.

class factly.cli.RichGroup(name: str | None = None, commands: MutableMapping[str, Command] | Sequence[Command] | None = None, invoke_without_command: bool = False, no_args_is_help: bool | None = None, subcommand_metavar: str | None = None, chain: bool = False, result_callback: Callable[[...], Any] | None = None, **kwargs: Any)[source]

Custom Click group that displays a banner before the help text.

format_help(ctx, formatter)[source]

Writes the help into the formatter if it exists.

This method is called by Click when the help text is requested.

factly.cli.get_copyright() → str[source]: Get copyright info.

factly.cli.get_version() → str[source]: Get version info.

The `benchmarks` module

class factly.benchmarks.MMLUBenchmark(tasks: list[MMLUTask] | None = None, n_shots: int = 0, n_problems_per_task: int | None = None, **kwargs)[source]

async a_evaluate(model: FactlyGptModel, workers: int | None = None) → float[source]

Evaluate a model on the MMLU benchmark with progress tracking.

Overrides the base MMLU evaluate method to provide a cleaner evaluation process with parallel question processing for better performance.

Parameters:

model – The model to evaluate
workers – Number of concurrent question evaluations (default: auto-determined)

Returns:

The overall accuracy score

set_concurrency(max_concurrent: int | None = None)[source]

Set the maximum number of concurrent question evaluations.

Parameters:: max_concurrent – Maximum number of concurrent question evaluations

factly.benchmarks.evaluate(instructions: Path, settings: FactlySettings, tasks: list[MMLUTask] | None = None, workers: int | None = None, plot: bool = False, plot_path: Path | None = None)[source]

Evaluate models with different prompts on the MMLU benchmark.

Parameters:

instructions – Path to YAML file with system instructions
settings – FactlySettings object containing model and inference settings
tasks – List of MMLU tasks to evaluate (defaults to CS and Astronomy)
workers – Number of concurrent workers for model evaluations (default: auto-determined based on system resources)
plot – Whether to generate a plot of the results (default: False)
plot_path – Path to save the plot (default: ./outputs/factuality-<model>-t<count>.png)

factly.benchmarks.load_instructions(path: Path) → list[dict][source]: Load system instructions from a YAML file.

The `plots` module

Plotting utilities for Factly benchmarks.

factly.plots.add_metadata_footer(fig: Figure, model_name: str, tasks: list[str] | None = None) → None[source]

Add a metadata footer to the plot with date, model, and tasks information.

Parameters:

fig – The matplotlib figure to add footer to
model_name – Name of the model used for evaluation
tasks – List of task names used in the evaluation

factly.plots.generate_factuality_comparison_plot(results: list[tuple[float, str]], model_name: str, output_path: Path | None = None, tasks: list[str] | None = None) → Path[source]

Generate a bar chart comparing factuality scores of different prompts.

Parameters:

results – List of tuples containing (score, prompt_name)
model_name – Name of the LLM model used for the benchmark
output_path – Path to save the plot (default: creates outputs dir in cwd)
tasks – List of MMLU task names used in the benchmark

Returns:

Path to the saved plot file

The `resources` module

The `settings` module

Settings module for Factly CLI.

Defines configuration models for API, inference, and overall application settings.

class factly.settings.FactlySettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, model: ~factly.settings.ModelSettings = <factory>, inference: ~factly.settings.InferenceSettings = <factory>)[source]

Aggregated settings for the Factly CLI, including model and inference configuration.

model

Model API and authentication settings.

Type:: ModelSettings

inference

Inference-time decoding parameters.

Type:: InferenceSettings

classmethod create(**kwargs) → FactlySettings[source]

Create FactlySettings with optional overrides.

This factory method handles nested configuration with dictionaries.

Parameters:: **kwargs – Configuration overrides including nested dictionaries for model and inference settings.
Returns:: A settings instance.
Return type:: FactlySettings

Example

>>> # Create with API key and custom temperature
>>> settings = FactlySettings.create(
...     model={"api_key": "sk-abc123", "model": "gpt-4o"},
...     inference={"temperature": 0.1}
... )
>>>
>>> # Alternatively, update settings after creation:
>>> settings = FactlySettings()
>>> settings.model.api_key = "sk-abc123"
>>> settings.inference.temperature = 0.1

Create settings by combining CLI arguments with environment variables.

CLI arguments take precedence over environment variables and defaults. Only non-None CLI values will override settings from the environment.

Parameters:

model – Model name (e.g., “gpt-4o”)
api_key – API key for the model provider
api_base – Base URL for the API
temperature – Sampling temperature
top_p – Nucleus sampling parameter
max_tokens – Maximum tokens to generate
n_shots – Number of examples for few-shot learning

Returns:

Combined settings with proper priority

Return type:

FactlySettings

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class factly.settings.InferenceSettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, temperature: Annotated[float, Ge(ge=0.0), Le(le=2.0)] = 0.0, top_p: Annotated[float, Gt(gt=0.0), Le(le=1.0)] = 1.0, max_tokens: Annotated[int, Gt(gt=0)] = 256, n_shots: Annotated[int, Ge(ge=0)] = 0)[source]

Inference-time parameters for LLM decoding, following MMLU best practices.

temperature

Sampling temperature. Default set to 0.0 to ensure deterministic, reproducible outputs by disabling sampling randomness.

Type:: float

top_p

Nucleus sampling parameter. Controls how much of the probability mass the model is allowed to sample from. Default set to 1.0 to disable nucleus sampling, guaranteeing the model always selects the most probable token.

Type:: float

max_tokens

Maximum tokens to generate. Default set to 256 to allow sufficient space for model reasoning. For standard MMLU, you typically want just 1 token (A/B/C/D answers), but setting max_tokens: 1 will break benchmarks if your prompts expect structured outputs (e.g., JSON) or encourage reasoning before answering. With higher max_tokens, you may need to post-process results to extract final answers.

Type:: int

n_shots

Number of examples for few-shot learning. Default set to 0 for zero-shot evaluation. Increasing this value provides more demonstration examples in prompts to help the model understand the task format.

Type:: int

Note

When using n_shots > 0, consider setting max_tokens > 1 to allow the model to follow the reasoning patterns demonstrated in few-shot examples. Setting max_tokens=1 with n_shots > 0 may cause the model to ignore the reasoning pattern in examples and only output a token.

classmethod create(**kwargs) → InferenceSettings[source]

Create an InferenceSettings instance with optional overrides.

Parameters:: **kwargs – Override default settings values.
Returns:: A settings instance.
Return type:: InferenceSettings

classmethod for_mmlu(n_shots: int = 0) → InferenceSettings[source]

Create inference settings configured for traditional MMLU benchmarking.

Uses max_tokens=1 for single-letter answers, which is the canonical setup for standard MMLU evaluation where only a single token (A/B/C/D) is expected.

Returns:: MMLU-optimized settings (temperature=0, top_p=1, max_tokens=1).
Return type:: InferenceSettings

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Configuration for the LLM API connection and model selection.

api_base

Base URL for the model API endpoint.

Type:: str

model

Model name or identifier (e.g., ‘gpt-4o’).

Type:: str

api_key

API key for authenticating with the model provider. Set to None for local models that don’t require authentication.

Type:: Optional[str]

classmethod create(**kwargs) → ModelSettings[source]

Create a ModelSettings instance with optional overrides.

Parameters:: **kwargs – Override default settings values.
Returns:: A settings instance.
Return type:: ModelSettings

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': '.env', 'env_file_encoding': 'utf-8', 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'OPENAI_', 'extra': 'ignore', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

The `tasks` module

MMLU task registry and management for Factly.

class factly.tasks.TaskCategory(*values)[source]: Categories for organizing MMLU tasks.

factly.tasks.get_all_tasks() → list[MMLUTask][source]

Get all supported MMLU tasks.

Returns:: List of all MMLU tasks supported by Factly

factly.tasks.get_task_by_name(name: str) → MMLUTask | None[source]

Get an MMLU task by its name (case-insensitive).

Parameters:: name – The name of the task, can be partial match
Returns:: The matching MMLU task or None if not found

factly.tasks.get_tasks_by_category(category: TaskCategory) → list[MMLUTask][source]

Get all tasks belonging to a specific category.

Parameters:: category – The category to filter by
Returns:: List of MMLU tasks in the specified category

factly.tasks.list_available_tasks() → str[source]

Generate a formatted string listing all available tasks.

Returns:: Formatted string with all available tasks grouped by category

factly.tasks.resolve_tasks(task_names: list[str]) → list[MMLUTask][source]

Resolve a list of task names to actual MMLU tasks.

Parameters:: task_names – List of task names provided by the user
Returns:: List of resolved MMLU tasks
Raises:: ValueError – If any task name cannot be resolved

API Reference

The factly module

The __main__ module

The factly.llms module

The factly.llms.base_model module

The factly.llms.openai_model module

The cli module

The benchmarks module

The plots module

The resources module

The settings module

The tasks module

The `factly` module

The `main` module

The `factly.llms` module

The `factly.llms.base_model` module

The `factly.llms.openai_model` module

The `cli` module

The `benchmarks` module

The `plots` module

The `resources` module

The `settings` module

The `tasks` module