Overview

Glosary
LLM Settings

Prompt Engineering: process of carefully designing and optimizing instructions (prompts) to elicit the best possible output from generative AI models, especially Large Language Models (LLMs). By providing clear, specific, and well-structured prompts, you can guide the AI to generate relevant, accurate, and high-quality responses
Prompt: input you provide to a generative AI model to request a specific output. It can be a simple question, a set of instructions, or even a creative writing example
Large Language Model (LLM): AI model designed to understand and generate human-like text. LLMs are trained on vast amounts of data and can perform tasks like translation, summarization, and even creative writing
Prompt Template: a pre-defined structure or format for a prompt that can be customized with specific details or variables to generate dynamic prompts
Prompt Tuning: process of fine-tuning pre-trained LLMs by adapting them to specific tasks or domains through prompt engineering, rather than traditional fine-tuning methods
Prompt Injection: a security vulnerability where an attacker manipulates the input prompt to influence the AI model's behavior in unintended ways, potentially leading to unauthorized actions or disclosures
Prompt Leakage: situation where sensitive information from the prompt is inadvertently included in the generated output, posing privacy or security risks
Prompt Bias: tendency of an AI model to generate responses that reflect the biases present in its training data, leading to unfair or inaccurate outcomes
Prompt Hallucination: when an AI model generates information that is not supported by the input prompt or its training data, leading to false or misleading outputs
Prompt Testing: process of evaluating and validating prompts to ensure they produce the desired output, meet quality standards, and comply with ethical and regulatory requirements
Prompt Optimization: continuous process of refining prompts to improve their performance, based on feedback, testing results, and changes in the AI model or its training data
Context Window: max number of tokens the model can process at once, including input and output. Often a model-specific architectural limit

Category	Setting Parameter	Description	Low Value Use Cases	High Value Use Cases
Sampling	Temperature	controls the randomness or "creativity" of the output. Higher values lead to more diverse and imaginative responses, while lower values make the output more deterministic and focused	factual Q&A, summarization	story generation, poetry, brainstorming
	Top-P (Nucleus Sampling)	selects tokens from the smallest possible set whose cumulative probability exceeds the `top_p` threshold. Works in conjunction with temperature to control diversity	precise answers	varied and imaginative text
	Top-K Sampling	limits the token selection to the top `k` most probable tokens at each step. The model will only consider words within this `k` set. Often used in conjunction with Top-P	limits token selection to the top `k` options for more focused output	expands token options for greater diversity and creativity, but may include less relevant choices
Advanced Sampling	Logit Bias	allows you to modify the probability of specific tokens appearing or not appearing in the generated output. You can increase or decrease the likelihood of certain words	reduces the likelihood of tokens with negative bias, prompting the model to avoid specific words	increases the likelihood of tokens with positive bias, encouraging the model to include specific words or phrases
Output Control	Max Length / Max Tokens	sets the maximum number of tokens the model will generate in its response. This includes both the input prompt and the generated output in some APIs	summarization, quick answers: concise, cost-effective responses, cutting off if necessary	essay generation, code generation, detailed explanations: more detailed responses, but manage to avoid irrelevance and high costs
	Stop Sequences	string or list of strings that, when encountered in the generated output, will stop the model from generating further tokens	stops generating text at specified sequences, ensuring structured outputs and preventing run-ons	continues generating until reaching max tokens or an end-of-text token
	N (Number of Completions)	specifies how many independent completions (responses) the model should generate for a single prompt	produces one response, typical for direct answers	creates several distinct responses for selection or variation, potentially increasing cost
Repetition Control	Frequency Penalty	applies a penalty to new tokens based on how many times that token has already appeared in the text (prompt + generated response)	allows repetition with less penalty, increasing the likelihood of repeated words or phrases	imposes a higher penalty on repetition, promoting new vocabulary and discouraging repeated tokens
Repetition Control	Presence Penalty	imposes a uniform penalty on new tokens that have appeared in the text at least once, regardless of their frequency	reduces penalties on previously mentioned tokens to maintain focus on a specific topic	increases penalties on previously used tokens to encourage diverse and distinct ideas
Reproducibility	Seed	setting a seed makes the model's output deterministic for a given set of parameters	guarantees consistent results for repeated calls with the same prompt and settings, aiding debugging and reproducibility	each call with the same prompt and settings yields a different output, while still adhering to other parameters
Input Processing	Context Window (Max Context Length)	maximum number of tokens (input prompt + generated output) that the model can process and consider at one time. This is often a model-specific architectural limit	short prompts limit the model's memory of prior conversation, causing context loss in longer interactions	long conversations and large document analysis allow the model to maintain context, enhancing coherence and relevance in extended interactions
Model Selection	Model Name/ID	specifies the particular LLM variant or version to be used. Different models have varying capabilities, sizes, and training data	smaller models may produce lower quality, less nuanced responses and have limited capabilities	larger models generally provide higher quality, more nuanced responses, but may incur higher costs and slower inference
Generation Strategy	Decoding Type	refers to the algorithm used to select the next token. Common types include greedy decoding, beam search, and sampling (which involves temperature, top-p, top-k)	"Greedy" selection yields deterministic but potentially less creative output by always choosing the highest probability	"Sampling" adds variability, while "beam search" explores multiple sequences to identify more globally optimal outputs