===== Usage ===== This document provides detailed instructions for using the ``factly`` command-line tool. Command Line Interface ====================== The primary entrypoint for Factly is the ``factly`` command-line interface, which provides tools to evaluate the factuality of LLMs on the MMLU benchmark. Basic Usage ----------- .. code-block:: bash # Run factuality evaluation with default settings factly mmlu # Run evaluation and generate plots factly mmlu --plot # Get help on all available options factly mmlu --help # List available MMLU tasks factly list-tasks Command Structure ----------------- Factly provides the following commands: .. code-block:: text factly [OPTIONS] COMMAND [ARGS]... Main Commands: * ``mmlu``: Run factuality evaluation on MMLU benchmark * ``list-tasks``: List all available MMLU tasks Common Global Options: * ``--help``: Show help message and exit * ``--version``: Show version and exit Command Line Options for ``mmlu`` ------------------------------------- --instructions file Path to YAML file with system instruction variants. Default: ``./instructions.yaml`` -m name, --model name Model name to use for evaluation. Default: ``OPENAI_MODEL`` from environment variables, ``.env`` file, or ``gpt-4o`` as a fallback. and long descriptions -u host, --url host Model API URL to use for evaluation. Default: ``OPENAI_API_BASE`` from environment variables, ``.env`` file, or unset. -a key, --api-key key Model API key to use for evaluation. Default: ``OPENAI_API_KEY`` from environment variables, ``.env`` file, or unset. -t float, --temperature float Controls randomness in token selection. For MMLU benchmarking, ``0.0`` (deterministic) is the canonical setting to ensure reproducible results and measure raw model knowledge. Higher values introduce randomness, which isn't suitable for standard benchmarking. Default: ``0.0``. --top-p float, --top-p float Nucleus sampling parameter that controls how much of the probability mass the model samples from. For benchmarking, keep at ``1.0`` to disable nucleus sampling. Only modify when exploring controlled randomness in outputs. Default: ``1.0``. --max-tokens int, --max-tokens int Maximum tokens per response. For standard MMLU with single-letter answers (A/B/C/D), use ``1``. For structured outputs or when using system prompts that encourage reasoning, use higher values (``256`` or more) and post-process to extract final answers. Default: ``256``. --n-shots int, --n-shots int Number of examples for few-shot learning. Default set to 0 for zero-shot evaluation. Increasing this value provides more demonstration examples in prompts to help the model understand the task format. Default: ``0``. --tasks name, --tasks name MMLU task categories to evaluate (can be repeated). Default: All tasks. -j int, --workers int Maximum number of concurrent question evaluations. Default: Auto-detected based on system resources. --plot Generate visualization plots. Default: ``False``. --plot-path path Path to save the plot. Default: ``./outputs/factuality--t.png``. --verbose Show detailed progress information during evaluation. Default: ``False``. --help Show help message and exit. .. note:: * For canonical MMLU evaluation (comparable to published benchmarks), use ``--n-shots 0`` with ``--max-tokens 1`` * When using few-shot learning (``--n-shots >0``), consider setting ``--max-tokens`` to a higher value (≥256) to allow the model to follow reasoning patterns from the examples * If you must use ``--n-shots >0`` with ``--max-tokens 1``, ensure your few-shot examples only demonstrate single-token answers without reasoning * For structured output (like JSON) or when using system prompts that encourage reasoning, always use ``--max-tokens >1`` regardless of the ``--n-shots`` value Command Line Options for ``list-tasks`` --------------------------------------- --help Show help message and exit. Advanced Usage ============== Task Selection -------------- You can select specific MMLU tasks to evaluate: .. code-block:: bash # Evaluate specific model on selected MMLU tasks factly mmlu --model gpt-4o --tasks mathematics --tasks high_school_us_history # Evaluate on STEM tasks only factly mmlu --tasks STEM # Evaluate on business-related tasks factly mmlu --tasks BUSINESS Few-Shot Learning ----------------- Configure the number of examples provided for few-shot learning: .. code-block:: bash # Zero-shot evaluation (default) factly mmlu --n-shots 0 # 3-shot evaluation factly mmlu --n-shots 3 # 5-shot evaluation factly mmlu --n-shots 5 Performance Optimization ------------------------ Factly uses asynchronous concurrent processing to maximize evaluation throughput. It evaluates multiple questions concurrently for each model, significantly reducing total evaluation time. You can control the concurrency level with the ``--workers`` parameter: .. code-block:: bash # Auto-determine optimal concurrency (default) factly mmlu --tasks STEM # Set concurrency level explicitly (process 20 questions in parallel) factly mmlu --tasks STEM --workers 20 The implementation uses ``asyncio`` and semaphores for controlled concurrency with automatic resource detection for optimal performance across different environments. System Instructions ------------------- Factly supports different system instructions for prompt engineering experiments: .. code-block:: bash # Use the default instruction from instructions.yaml in current directory factly mmlu # Use a custom instructions defined in ~/path/to/instructions.yaml file factly mmlu --instructions ~/path/to/instructions.yaml By default instructions should be defined in the ``instructions.yaml`` file in current directory. Each instruction should provide a different way to guide the model's behavior when responding to questions. Examples ======== Basic Evaluation ---------------- .. code-block:: bash # Run basic evaluation with default settings factly mmlu # Run evaluation and generate plots factly mmlu --plot # Run verbose evaluation with plots factly mmlu --verbose --plot Subject-Specific Evaluation --------------------------- .. code-block:: bash # Evaluate mathematics knowledge factly mmlu --tasks mathematics --n-shots 3 --plot # Evaluate humanities subjects factly mmlu --tasks high_school_european_history --tasks high_school_us_history --plot # Evaluate computer science knowledge factly mmlu --tasks computer_science --verbose --plot Customized Evaluation --------------------- .. code-block:: bash # Customize API settings and system instruction factly mmlu \ -m gpt-4o-mini \ -u https://your-proxy.example.com/v1 \ -a your_api_key_here \ --instructions ~/path/to/instructions.yaml # Customize model inference parameters factly mmlu \ --model gpt-4o \ --temperature 0.7 \ --top-p 0.95 \ --max-tokens 512 \ --tasks mathematics \ --plot Environment Variables ===================== Instead of specifying command-line arguments each time, you can set environment variables in the ``.env`` file: .. code-block:: bash # API Configuration OPENAI_API_KEY=your_api_key_here OPENAI_MODEL=gpt-4o OPENAI_API_BASE=your_api_base_url # Optional