API Reference
The factly module
CLI tool to evaluate ChatGPT factuality on MMLU benchmark.
The __main__ module
- factly.__main__.init() None[source]
Run factly.cli.main() when current file is executed by an interpreter.
This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.
The
sys.exit()function is called with the return value offactly.cli.main(), following standard UNIX program conventions for exit codes.
The factly.llms module
- class factly.llms.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]
Factly GPT model.
- __init__(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]
Initialize the Factly GPT model.
- Parameters:
model – Model identifier, can be “<provider>/<model>” for LiteLLM or “<model>” for direct provider models
system_prompt – System prompt to use for generating responses
prompt_name – Display name for this model configuration in reports
temperature – Sampling temperature between 0.0 and 2.0
top_p – Nucleus sampling parameter between 0.0 and 1.0
max_tokens – Maximum number of tokens to generate
- class factly.llms.FactlyModelMixin[source]
Mixin providing common functionality for Factly LLM models.
This mixin standardizes common operations across different model implementations such as message formatting and model name handling.
- __weakref__
list of weak references to the object
- create_messages(prompt: str)[source]
Format prompt into chat completion messages.
- Parameters:
prompt – User input prompt
- Returns:
List of message objects with appropriate roles
The factly.llms.base_model module
Common mixin for Factly LLM model implementations.
- class factly.llms.base_model.FactlyModelMixin[source]
Mixin providing common functionality for Factly LLM models.
This mixin standardizes common operations across different model implementations such as message formatting and model name handling.
- create_messages(prompt: str)[source]
Format prompt into chat completion messages.
- Parameters:
prompt – User input prompt
- Returns:
List of message objects with appropriate roles
The factly.llms.openai_model module
- class factly.llms.openai_model.FactlyGptModel(model: str, system_prompt: str, prompt_name: str, temperature: float = 0.0, top_p: float = 1.0, max_tokens: int = 1, *args, **kwargs)[source]
Factly GPT model.
The cli module
Factly CLI entrypoint.
- class factly.cli.RichGroup(name: str | None = None, commands: MutableMapping[str, Command] | Sequence[Command] | None = None, invoke_without_command: bool = False, no_args_is_help: bool | None = None, subcommand_metavar: str | None = None, chain: bool = False, result_callback: Callable[[...], Any] | None = None, **kwargs: Any)[source]
Custom Click group that displays a banner before the help text.
The benchmarks module
- class factly.benchmarks.MMLUBenchmark(tasks: list[MMLUTask] | None = None, n_shots: int = 0, n_problems_per_task: int | None = None, **kwargs)[source]
- async a_evaluate(model: FactlyGptModel, workers: int | None = None) float[source]
Evaluate a model on the MMLU benchmark with progress tracking.
Overrides the base MMLU evaluate method to provide a cleaner evaluation process with parallel question processing for better performance.
- Parameters:
model – The model to evaluate
workers – Number of concurrent question evaluations (default: auto-determined)
- Returns:
The overall accuracy score
- factly.benchmarks.evaluate(instructions: Path, settings: FactlySettings, tasks: list[MMLUTask] | None = None, workers: int | None = None, plot: bool = False, plot_path: Path | None = None)[source]
Evaluate models with different prompts on the MMLU benchmark.
- Parameters:
instructions – Path to YAML file with system instructions
settings – FactlySettings object containing model and inference settings
tasks – List of MMLU tasks to evaluate (defaults to CS and Astronomy)
workers – Number of concurrent workers for model evaluations (default: auto-determined based on system resources)
plot – Whether to generate a plot of the results (default: False)
plot_path – Path to save the plot (default: ./outputs/factuality-<model>-t<count>.png)
The plots module
Plotting utilities for Factly benchmarks.
Add a metadata footer to the plot with date, model, and tasks information.
- Parameters:
fig – The matplotlib figure to add footer to
model_name – Name of the model used for evaluation
tasks – List of task names used in the evaluation
- factly.plots.generate_factuality_comparison_plot(results: list[tuple[float, str]], model_name: str, output_path: Path | None = None, tasks: list[str] | None = None) Path[source]
Generate a bar chart comparing factuality scores of different prompts.
- Parameters:
results – List of tuples containing (score, prompt_name)
model_name – Name of the LLM model used for the benchmark
output_path – Path to save the plot (default: creates outputs dir in cwd)
tasks – List of MMLU task names used in the benchmark
- Returns:
Path to the saved plot file
The resources module
The settings module
Settings module for Factly CLI.
Defines configuration models for API, inference, and overall application settings.
- class factly.settings.FactlySettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, model: ~factly.settings.ModelSettings = <factory>, inference: ~factly.settings.InferenceSettings = <factory>)[source]
Aggregated settings for the Factly CLI, including model and inference configuration.
- model
Model API and authentication settings.
- Type:
- inference
Inference-time decoding parameters.
- Type:
- classmethod create(**kwargs) FactlySettings[source]
Create FactlySettings with optional overrides.
This factory method handles nested configuration with dictionaries.
- Parameters:
**kwargs – Configuration overrides including nested dictionaries for model and inference settings.
- Returns:
A settings instance.
- Return type:
Example
>>> # Create with API key and custom temperature >>> settings = FactlySettings.create( ... model={"api_key": "sk-abc123", "model": "gpt-4o"}, ... inference={"temperature": 0.1} ... ) >>> >>> # Alternatively, update settings after creation: >>> settings = FactlySettings() >>> settings.model.api_key = "sk-abc123" >>> settings.inference.temperature = 0.1
- classmethod from_cli(model: str | None = None, api_key: str | None = None, api_base: str | None = None, temperature: float | None = None, top_p: float | None = None, max_tokens: int | None = None, n_shots: int | None = None) FactlySettings[source]
Create settings by combining CLI arguments with environment variables.
CLI arguments take precedence over environment variables and defaults. Only non-None CLI values will override settings from the environment.
- Parameters:
model – Model name (e.g., “gpt-4o”)
api_key – API key for the model provider
api_base – Base URL for the API
temperature – Sampling temperature
top_p – Nucleus sampling parameter
max_tokens – Maximum tokens to generate
n_shots – Number of examples for few-shot learning
- Returns:
Combined settings with proper priority
- Return type:
- model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class factly.settings.InferenceSettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, temperature: Annotated[float, Ge(ge=0.0), Le(le=2.0)] = 0.0, top_p: Annotated[float, Gt(gt=0.0), Le(le=1.0)] = 1.0, max_tokens: Annotated[int, Gt(gt=0)] = 256, n_shots: Annotated[int, Ge(ge=0)] = 0)[source]
Inference-time parameters for LLM decoding, following MMLU best practices.
- temperature
Sampling temperature. Default set to 0.0 to ensure deterministic, reproducible outputs by disabling sampling randomness.
- Type:
- top_p
Nucleus sampling parameter. Controls how much of the probability mass the model is allowed to sample from. Default set to 1.0 to disable nucleus sampling, guaranteeing the model always selects the most probable token.
- Type:
- max_tokens
Maximum tokens to generate. Default set to 256 to allow sufficient space for model reasoning. For standard MMLU, you typically want just 1 token (A/B/C/D answers), but setting max_tokens: 1 will break benchmarks if your prompts expect structured outputs (e.g., JSON) or encourage reasoning before answering. With higher max_tokens, you may need to post-process results to extract final answers.
- Type:
- n_shots
Number of examples for few-shot learning. Default set to 0 for zero-shot evaluation. Increasing this value provides more demonstration examples in prompts to help the model understand the task format.
- Type:
Note
When using n_shots > 0, consider setting max_tokens > 1 to allow the model to follow the reasoning patterns demonstrated in few-shot examples. Setting max_tokens=1 with n_shots > 0 may cause the model to ignore the reasoning pattern in examples and only output a token.
- classmethod create(**kwargs) InferenceSettings[source]
Create an InferenceSettings instance with optional overrides.
- Parameters:
**kwargs – Override default settings values.
- Returns:
A settings instance.
- Return type:
- classmethod for_mmlu(n_shots: int = 0) InferenceSettings[source]
Create inference settings configured for traditional MMLU benchmarking.
Uses max_tokens=1 for single-letter answers, which is the canonical setup for standard MMLU evaluation where only a single token (A/B/C/D) is expected.
- Returns:
MMLU-optimized settings (temperature=0, top_p=1, max_tokens=1).
- Return type:
- model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': None, 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': '', 'extra': 'forbid', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class factly.settings.ModelSettings(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, *, api_base: str = 'https://api.openai.com/v1', model: str = 'gpt-4o', api_key: str | None = None)[source]
Configuration for the LLM API connection and model selection.
- api_key
API key for authenticating with the model provider. Set to None for local models that don’t require authentication.
- Type:
Optional[str]
- classmethod create(**kwargs) ModelSettings[source]
Create a ModelSettings instance with optional overrides.
- Parameters:
**kwargs – Override default settings values.
- Returns:
A settings instance.
- Return type:
- model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': '.env', 'env_file_encoding': 'utf-8', 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'OPENAI_', 'extra': 'ignore', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
The tasks module
MMLU task registry and management for Factly.
- factly.tasks.get_all_tasks() list[MMLUTask][source]
Get all supported MMLU tasks.
- Returns:
List of all MMLU tasks supported by Factly
- factly.tasks.get_task_by_name(name: str) MMLUTask | None[source]
Get an MMLU task by its name (case-insensitive).
- Parameters:
name – The name of the task, can be partial match
- Returns:
The matching MMLU task or None if not found
- factly.tasks.get_tasks_by_category(category: TaskCategory) list[MMLUTask][source]
Get all tasks belonging to a specific category.
- Parameters:
category – The category to filter by
- Returns:
List of MMLU tasks in the specified category