Detailed Explanation of CLIP Score
CLIP Score is a specialized metric that measures how faithfully a generated image adheres to its text prompt, and together with FID, it is one of the most commonly used tools in evaluating image generation models.
Calculation principle: The CLIP model embeds both text and images into the same vector space. The CLIP text embedding is calculated for a given prompt, and the CLIP image embedding is calculated for the generated image. The cosine similarity between the two vectors is reported as the CLIP score. This score between 0 and 1 (or as an angle between 0° and 90°) — the higher the value, the more semantically close the prompt and image are.
Why does it matter? FID measures quality and diversity but cannot measure whether the prompt is correctly applied. CLIP score fills this gap. If a model produces visually perfect but prompt-irrelevant images, it appears as high FID with low CLIP score.
Practical use cases:
1. Model comparison: Responses from different image generation models to a benchmark prompt set are compared using CLIP score. Benchmark sets like PartiPrompts and DrawBench are used for this purpose.
2. Automatic image selection: When many images are generated from a prompt and CLIP scores are calculated, the image with the highest score can be automatically selected. Some Stable Diffusion interfaces offer this feature in 'best of N' mode.
3. Fine-tuning evaluation: After fine-tuning a model to a specific style or concept, CLIP score is used to track how prompt alignment changes.
4. CFG scale optimization: How does CLIP score change at different CFG values? Generally, very low CFG results in low CLIP score (low prompt adherence), while very high CFG can create artifacts. The optimal CFG value will be near the CLIP score maximum.
Limitations of CLIP score: CLIP's training data is primarily English; CLIP scores calculated for Turkish prompts may be less reliable than for English prompts. Also, for very long and complex prompts, the entire prompt may not be evaluated due to CLIP's 77 token limit.
Derivatives like PICA and Aesthetic CLIP are versions of CLIP score combined with aesthetic evaluation. These hybrid metrics simultaneously measure both prompt alignment and visual quality.
CLIP score is actively used in the background evaluation processes of tools on tasarım.ai. This metric is the primary reference point especially in model quality reports and comparative evaluations.
Tip for beginners: When you've generated many images and aren't sure which to choose, you can think of CLIP score as a criterion for asking which image best reflects your prompt. Some tools offer a 'best match' option that does this calculation for you.