Overview
- Glosary
- LLM Settings
- Prompt Engineering: process of carefully designing and optimizing instructions (prompts) to elicit the best possible output from generative AI models, especially Large Language Models (LLMs). By providing clear, specific, and well-structured prompts, you can guide the AI to generate relevant, accurate, and high-quality responses
- Prompt: input you provide to a generative AI model to request a specific output. It can be a simple question, a set of instructions, or even a creative writing example
- Large Language Model (LLM): AI model designed to understand and generate human-like text. LLMs are trained on vast amounts of data and can perform tasks like translation, summarization, and even creative writing
- Prompt Template: a pre-defined structure or format for a prompt that can be customized with specific details or variables to generate dynamic prompts
- Prompt Tuning: process of fine-tuning pre-trained LLMs by adapting them to specific tasks or domains through prompt engineering, rather than traditional fine-tuning methods
- Prompt Injection: a security vulnerability where an attacker manipulates the input prompt to influence the AI model's behavior in unintended ways, potentially leading to unauthorized actions or disclosures
- Prompt Leakage: situation where sensitive information from the prompt is inadvertently included in the generated output, posing privacy or security risks
- Prompt Bias: tendency of an AI model to generate responses that reflect the biases present in its training data, leading to unfair or inaccurate outcomes
- Prompt Hallucination: when an AI model generates information that is not supported by the input prompt or its training data, leading to false or misleading outputs
- Prompt Testing: process of evaluating and validating prompts to ensure they produce the desired output, meet quality standards, and comply with ethical and regulatory requirements
- Prompt Optimization: continuous process of refining prompts to improve their performance, based on feedback, testing results, and changes in the AI model or its training data
- Context Window: max number of tokens the model can process at once, including input and output. Often a model-specific architectural limit
Category | Setting Parameter | Description | Low Value Use Cases | High Value Use Cases |
---|---|---|---|---|
Sampling | Temperature | controls the randomness or "creativity" of the output. Higher values lead to more diverse and imaginative responses, while lower values make the output more deterministic and focused | factual Q&A, summarization | story generation, poetry, brainstorming |
Top-P (Nucleus Sampling) | selects tokens from the smallest possible set whose cumulative
probability exceeds the | precise answers | varied and imaginative text | |
Top-K Sampling | limits the token selection to the top | limits token selection to the top | expands token options for greater diversity and creativity, but may include less relevant choices | |
Advanced Sampling | Logit Bias | allows you to modify the probability of specific tokens appearing or not appearing in the generated output. You can increase or decrease the likelihood of certain words | reduces the likelihood of tokens with negative bias, prompting the model to avoid specific words | increases the likelihood of tokens with positive bias, encouraging the model to include specific words or phrases |
Output Control | Max Length / Max Tokens | sets the maximum number of tokens the model will generate in its response. This includes both the input prompt and the generated output in some APIs | summarization, quick answers: concise, cost-effective responses, cutting off if necessary | essay generation, code generation, detailed explanations: more detailed responses, but manage to avoid irrelevance and high costs |
Stop Sequences | string or list of strings that, when encountered in the generated output, will stop the model from generating further tokens | stops generating text at specified sequences, ensuring structured outputs and preventing run-ons | continues generating until reaching max tokens or an end-of-text token | |
N (Number of Completions) | specifies how many independent completions (responses) the model should generate for a single prompt | produces one response, typical for direct answers | creates several distinct responses for selection or variation, potentially increasing cost | |
Repetition Control | Frequency Penalty | applies a penalty to new tokens based on how many times that token has already appeared in the text (prompt + generated response) | allows repetition with less penalty, increasing the likelihood of repeated words or phrases | imposes a higher penalty on repetition, promoting new vocabulary and discouraging repeated tokens |
Repetition Control | Presence Penalty | imposes a uniform penalty on new tokens that have appeared in the text at least once, regardless of their frequency | reduces penalties on previously mentioned tokens to maintain focus on a specific topic | increases penalties on previously used tokens to encourage diverse and distinct ideas |
Reproducibility | Seed | setting a seed makes the model's output deterministic for a given set of parameters | guarantees consistent results for repeated calls with the same prompt and settings, aiding debugging and reproducibility | each call with the same prompt and settings yields a different output, while still adhering to other parameters |
Input Processing | Context Window (Max Context Length) | maximum number of tokens (input prompt + generated output) that the model can process and consider at one time. This is often a model-specific architectural limit | short prompts limit the model's memory of prior conversation, causing context loss in longer interactions | long conversations and large document analysis allow the model to maintain context, enhancing coherence and relevance in extended interactions |
Model Selection | Model Name/ID | specifies the particular LLM variant or version to be used. Different models have varying capabilities, sizes, and training data | smaller models may produce lower quality, less nuanced responses and have limited capabilities | larger models generally provide higher quality, more nuanced responses, but may incur higher costs and slower inference |
Generation Strategy | Decoding Type | refers to the algorithm used to select the next token. Common types include greedy decoding, beam search, and sampling (which involves temperature, top-p, top-k) | "Greedy" selection yields deterministic but potentially less creative output by always choosing the highest probability | "Sampling" adds variability, while "beam search" explores multiple sequences to identify more globally optimal outputs |