1. Introduction
1.1 The practical problem
Founders building early-stage products routinely use large language models to evaluate and rewrite their landing pages. The workflow is simple: paste the page content, ask for feedback. The output varies from a bullet list of stylistic suggestions to a partial rewrite with no stated criteria. There is no standard for what a high-quality copy audit looks like, no rubric for evaluating the auditor, and no mechanism for knowing whether following the recommendations improves conversion.
This is not a niche problem. Landing page copy is one of the highest-leverage variables in early customer acquisition. A 20% improvement in conversion rate on a landing page has the same effect on pipeline as a 20% increase in ad spend, at zero marginal cost. Yet the tools used to optimize that copy operate without an evaluation framework.
1.2 The measurement problem
LLM benchmarks have evolved across four generations. From MMLU and SuperGLUE through HELM and BIG-Bench to Humanity's Last Exam, and now toward domain-specific instruments like SalesLLM, AD-Bench, and HealthBench, a consistent pattern emerges: general benchmarks saturate quickly and fail to predict performance on specialized tasks. Writing quality benchmarks measure fluency, coherence, and human preference - none of which are proven proxies for conversion impact.
Three specific gaps characterize the current state. First, reproducibility: a copy audit produced by a generic LLM varies between runs because the model changes without public changelog and the prompt is not versioned. Second, traceability: when a generic LLM flags a headline as weak, it provides no reference to a conversion research finding. Third, outcome validity: no existing benchmark measures whether following an LLM's copy recommendations produces a measurable improvement in cost per lead, signup rate, or any other conversion metric.
1.3 The contribution of this paper
We make three contributions. First, we map the existing landscape of LLM writing and marketing benchmarks against five properties we argue are necessary for a valid copy audit framework. Second, we propose CLEF, a framework defined by those five properties and grounded in a verified corpus of empirical conversion research. Third, we describe the conditions a CLEF-conformant implementation must satisfy and the protocol required to validate the framework at scale.
1.4 From CLEF to Kicker
Kicker operationalizes CLEF as a product workflow. The framework defines what a valid conversion-grounded landing page audit must evaluate; Kicker turns that framework into an executable process: landing page extraction, rule-based diagnosis, dimension-level scoring, rewrite generation, quality-gated recommendations, micro-budget traffic testing, and post-test interpretation.
This distinction matters. CLEF is the methodological layer. Kicker is the applied layer: it uses these principles to help early-stage founders move from subjective copy feedback to measurable demand signals such as signup rate, cost per lead, and audience-level traction.
2. State of the art
2.1 Four generations of LLM benchmarks
The history of LLM evaluation reflects a recurring dynamic: each generation of benchmarks is created to address the limitations of the previous one, and saturates as models improve faster than benchmark designers anticipated.
- Generation 1 (2019–2022) - General knowledge. MMLU, SuperGLUE, BIG-Bench. Frontier models now score above 90%, rendering them unable to differentiate leading models.
- Generation 2 (2023–2024) - Advanced reasoning. MMLU-Pro, GPQA, MATH-500, HumanEval. Top models cluster around 90% by early 2026.
- Generation 3 (2025) - Anti-saturation. Humanity's Last Exam released by CAIS and Scale AI: 2,500 expert-level questions. Best 2025 models scored only 30–35%.
- Generation 4 (2025–2026) - Domain-specific and agentic. SalesLLM, AD-Bench, HealthBench. Direct responses to the failure of general benchmarks to predict domain-specific performance.
2.2 Existing benchmarks for writing and marketing copy
EQ-Bench Creative Writing evaluates narrative quality, emotional depth, and prose style. Claude Opus 4.7 leads with an Elo score of 2216. It measures literary quality, not conversion-specific effectiveness.
WritingBench (NeurIPS 2025) covers six writing domains including Advertising and Marketing. It evaluates 15 models across 1,239 queries using five instance-specific criteria dynamically generated by the evaluation model itself - precisely what CLEF argues against.
LMArena Text measures crowd-sourced human preference. AD-Bench (2026) evaluates analytical task performance in advertising contexts, not copy quality. SalesLLM benchmarks persuasive capability in multi-turn dialogues, not static copy audit.
2.3 The reproducibility problem
Closed model benchmarks face a structural reproducibility challenge. A preregistered longitudinal study tracking three major model families over ten weekly waves found divergent stability trajectories: one model stable, one improving, one degrading mid-study. This model drift, combined with the absence of prompt versioning in generic LLM interfaces, means that a copy audit produced today is not reproducible tomorrow.
2.4 The empirical anchoring problem
The empirical literature on conversion copy has documented specific, quantified effects. Benchmarks that do not encode these findings cannot evaluate whether an LLM audit applies them.
NN/g, 1997 +124%
Remote Marketers, 2025 +29%
Unbounce, 57M conversions −24%
YC / VC Corner, 2025 −24%
Quantified conversion effects from the CLEF empirical corpus.
2.5 Gap analysis: existing benchmarks vs CLEF properties
The following matrix maps five necessary properties of a valid copy audit framework against leading existing benchmarks and CLEF.
| Property | EQ-Bench / LMArena | WritingBench | AD-Bench | CLEF |
|---|---|---|---|---|
| Reproducibility across runs | No | No | Partial | Yes |
| Traceability to empirical sources | No | No | Partial | Yes |
| Conversion-specific dimensions | No | Partial | Partial | Yes |
| Quality gate on rewrites | No | No | No | Yes |
| Outcome validity (CPL / signup) | No | No | No | Partial |
| Practical applicability | Yes | Partial | Yes | Yes |
The gap is structural, not incidental. Existing benchmarks were not designed to evaluate conversion copy audit. CLEF is designed for exactly that purpose.
3. Theoretical foundations of CLEF
3.1 The empirical corpus
CLEF is grounded in 12+ primary sources with documented empirical findings on conversion copy. Each source must present a verifiable finding with a measurable effect, must be publicly accessible for independent verification, and must be from a recognized practitioner or research institution.
| Source | Key finding | Effect | CLEF dimension(s) |
|---|---|---|---|
| Morkes & Nielsen, NN/g, 1997 | Objective copy outperforms promotional copy on usability | +124% | Clarity, Specificity |
| Unbounce, 57M conversions | High reading-level copy converts less than simpler copy | −24% | Clarity |
| Remote Marketers, 2025 | Replacing buzzwords with specific outcomes (fintech case) | +29% | Specificity, Differentiation |
| Peep Laja, CXL, 2022–2025 | Clarity trumps persuasion. One goal, one CTA. | Framework | Clarity, Conversion Cues |
| Julian Shapiro, Demand Curve | CR = Desire − (Confusion + Friction). Mom test required. | Formula | Clarity, Attractiveness |
| Joanna Wiebe, Copyhackers | Every line must support the conversion argument. | Rule | Specificity, Conversion Cues |
| Shanelle Mullin, CXL, 2023 | Specific outcome testimonials near CTA lift conversion. | Placement | Conversion Cues, Specificity |
| Henneke Duistermaat, Unbounce | Banning buzzword categories forces specificity. | Anti-rules | Differentiation, Clarity |
| NN/g, Estes, 2013 | User-centric language over maker-centric. | Rule | Attractiveness, Clarity |
| NN/g, Wang, 2024 | Purpose must be communicated at a glance. | Rule | Clarity, Conversion Cues |
| NN/g, Harley, 2016 | Up-front disclosure builds credibility. | Rule | Specificity, Conversion Cues |
| YC / Dominguez, VC Corner, 2025 | 1:1 attention ratio. Complex copy hurts CVR. | Multiple | Clarity, Conversion Cues |
| Joel Gascoigne, Buffer, 2013 | Landing page must make a falsifiable promise. | Principle | Clarity, Specificity |
3.2 The five CLEF dimensions
Each dimension is defined at the level of behavioral mechanism. The specific criteria, formulation, weighting, and interaction logic constitute the proprietary implementation and are not published in this paper.
Clarity
Clarity measures the speed and accuracy with which a first-time visitor grasps what the product does, who it serves, and what they will get. Users who do not understand an offer within the first few seconds abandon the page. Clarity is the precondition for all other dimensions.
Attractiveness
Attractiveness measures the degree to which the copy engages the visitor's emotional drivers - desires, fears, aspirations, frustrations - and creates a sense that the offer is relevant and valuable to them. Users who understand an offer but feel no pull toward it do not convert.
Specificity
Specificity measures the degree to which claims are concrete, verifiable, and anchored in evidence rather than assertion. Specific claims are more credible than general claims, and credibility is a precondition for trust, which is a precondition for conversion. Replacing vague claims with specific outcome statements produced a 29% conversion increase in a documented case.
Conversion Cues
Conversion Cues measures how effectively the page architecture and micro-copy guide the visitor toward the primary action. CTA visibility and wording, friction reduction, placement of social proof relative to moments of hesitation, and the absence of competing CTAs.
Differentiation
Differentiation measures the degree to which the page communicates what is distinctively valuable about this offer relative to alternatives - direct competitors, substitutes, and the status quo. A visitor who does not see why this offer rather than another defaults to inaction.
3.3 Justification of the five-dimension structure
The five dimensions were selected by a single criterion: inclusion requires documented empirical evidence of a measurable effect on conversion outcomes. Dimensions based solely on aesthetic quality, stylistic preference, or writing craft were excluded regardless of editorial merit. This constraint is what distinguishes CLEF from a style guide and grounds it in the empirical literature rather than practitioner opinion.
4. The five properties of CLEF
CLEF is defined by five verifiable properties. A framework claiming to evaluate conversion copy must satisfy all five to be valid.
4.1 Reproducibility
Definition. Two independent executions of a CLEF-conformant audit on the same page, using the same prompt version, must produce scores within a defined tolerance interval. We propose a tolerance of ±3 points per dimension on a 20-point scale.
Why existing benchmarks fail this property. Generic LLM interfaces do not version prompts. The underlying model changes without announcement. Evaluation criteria are generated dynamically (WritingBench) or through crowd votes that shift over time (LMArena). A framework built on generic LLM calls without prompt stability guarantees cannot satisfy reproducibility.
4.2 Traceability
Definition. Each diagnostic signal produced by a CLEF-conformant audit must be linkable to a specific rule, and each rule must be linkable to a specific source in the empirical corpus. The chain is: output signal → rule → source → measured effect.
Why existing benchmarks fail. Generic LLM outputs produce suggestions without stating criteria. WritingBench generates criteria dynamically per query. EQ-Bench produces narrative quality scores without linking them to conversion research. None provides a traceable chain.
4.3 Content validity
Definition. The dimensions evaluated must collectively cover the conversion-relevant dimensions documented in the empirical literature, without including dimensions not empirically grounded in conversion outcomes.
Why existing benchmarks fail. EQ-Bench and WritingBench include dimensions like emotional depth and narrative arc that have no documented relationship to conversion rate. Including them in a conversion copy audit is a validity failure: the framework measures something other than what it claims to measure.
4.4 Signal sensitivity
Definition. A CLEF-conformant framework must produce meaningfully different scores for pages that differ in conversion-relevant quality. A framework that assigns similar scores to a high-performing and a low-performing landing page provides no useful signal.
4.5 Practical applicability
Definition. A CLEF-conformant audit must be executable without requiring external human judges, must produce output actionable by a non-expert practitioner, and must run within a time and cost envelope compatible with early-stage product development. Verification: audit completes in under 5 minutes, costs under €1 per page in API costs, output actionable for a founder with no conversion optimization background.
5. Conditions for a CLEF-conformant implementation
5.1 Required architecture
A CLEF-conformant implementation requires three sequential phases. Each is necessary; removing any one produces a framework that fails one or more of the five properties.
- Phase 1 - Diagnosis. The system must analyze the page against each of the five dimensions and produce a structured score per dimension with a diagnostic explanation per section. The output must be schema-constrained: a free-text response that mentions the five dimensions does not satisfy this condition.
- Phase 2 - Rewriting. The system must produce rewrite candidates for sections scoring below threshold. Rewrite candidates must be constrained in length relative to the original and anchored in the same rule set as the diagnosis.
- Phase 3 - Quality gate. Each rewrite candidate must be evaluated by an independent validation step before it surfaces in the output. Candidates that do not satisfy the relevant criteria must be discarded, not downgraded. The user must see only validated rewrites.
5.2 Required properties of the rule set
- Rules must be stable across runs (not generated dynamically per request).
- Rules must be versioned and traceable to the corpus source that motivated them.
- Rules must be organized by dimension, with each rule assigned to exactly one primary dimension.
- Rules must be formulated as detectable conditions, not editorial preferences.
- Any change to the rule set must be reflected in the version identifier attached to all audits produced under that version.
5.3 What a CLEF-conformant implementation must not do
- Generate evaluation criteria dynamically per page (violates reproducibility and traceability).
- Present rewrite candidates without a validation step (violates reproducibility).
- Score dimensions without anchoring the score to the rule set (violates traceability).
- Produce a single holistic quality score without dimension-level breakdown (violates signal sensitivity).
- Change the rule set between runs of the same page version without changing the version identifier.
5.4 Public framework vs proprietary implementation
CLEF separates the public methodological standard from the proprietary implementation used by Kicker. The public framework consists of the five evaluation dimensions, the five validity properties, the minimum conditions for a conformant audit, and the benchmark protocol. The proprietary layer consists of the detailed rulebook, the exact scoring logic, the weighting and interaction rules, the prompt and schema design, the rewrite-validation pipeline, and the operational data produced by Kicker campaigns.
6. Preliminary validation
6.1 Reproducibility across runs
The reference implementation was tested for reproducibility across five independent runs on ten B2B SaaS landing pages. Dimension scores were consistent within the tolerance threshold (±3 points per dimension) for 94% of dimension-page combinations. The remaining 6% of cases showed variance on the Attractiveness dimension, which involves the most interpretive sub-criteria.
Reproducibility verified
Across 5 independent runs × 10 B2B SaaS landings = 50 audits.
94% of dimension-page combinations stayed within the ±3-point tolerance threshold on a 20-point scale.
6.2 Cost per lead result
Campaign context. The internal validation was conducted across 15 B2B SaaS campaign comparisons in the founder and early-stage product validation category. Campaigns ran on Meta Ads with fixed budgets and consistent target audiences across conditions.
Methodology. For each comparison, Condition A used the original landing page without modification. Condition B used a landing page revised following a full CLEF-conformant audit. All other variables (audience targeting, budget, ad creative format, campaign duration) were held as constant as operationally possible.
Result. Across the 15 comparisons, average CPL moved from approximately $8 in the baseline condition to approximately $4 after CLEF-conformant audit and rewrite - average reduction: 50%.
Disclaimer. N=15. Internal validation, not externally replicated. Presented as a practical signal that CLEF-conformant audits can produce meaningful outcome changes in real campaigns, not as proof that they generally produce 50% CPL reductions.
7. Model comparison: Kicker vs frontier LLMs
This comparison evaluates GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro against the five CLEF properties. All three are frontier models with documented strong performance on general writing benchmarks. The gap is not about raw capability. It is structural: none of them satisfy the CLEF properties by design, regardless of how good their outputs are.
| CLEF property | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro | Kicker |
|---|---|---|---|---|
| Prompt versioned and stable | No | No | No | Yes |
| Score variance across runs | Undocumented | Undocumented | Undocumented | 94% within tolerance |
| Rules linked to empirical sources | No | No | No | Yes |
| Output cites rule broken | No | No | No | Yes |
| Dimensions grounded in conversion research | No | Partial | Partial | Yes |
| Excludes non-conversion dimensions | No | No | No | Yes |
| Rewrites validated before output | No | No | No | Yes |
| Length guard on rewrites | None | None | None | ±30% max |
| Structured /100 score, 5 dimensions | No | No | No | Yes |
| Output actionable for non-expert founder | Partial | Partial | Partial | Yes |
The frontier LLMs lead on general writing benchmarks: Claude Opus 4.7 sits at #1 on EQ-Bench Creative Writing (Elo 2216), Gemini 3.1 Pro ranks #1 of 544 models on the Artificial Analysis Intelligence Index. CLEF does not measure literary quality. It measures conversion-specific audit validity.
8. Limitations and future work
8.1 Current limitations
- Dataset maturity. 15 internal campaign comparisons is sufficient to justify further study but insufficient to establish external validity.
- Absence of direct comparison. The preliminary validation compares against no audit (baseline), not against a generic LLM audit. The comparative advantage of CLEF over a well-prompted generic LLM remains unmeasured.
- Self-reference risk. The reference implementation and the framework were developed by the same team. Independent replication is required before the framework's properties can be considered validated.
- Model dependency. The reference implementation uses a specific commercial LLM as its base. Whether the five properties hold when the base model changes has not been tested.
- Single-language scope. All validation work was conducted on English-language landing pages.
8.2 The benchmark protocol
The following protocol is designed to produce statistically defensible results comparing CLEF-conformant audits against generic LLM audits and against baseline. Researchers wishing to contribute to the dataset are invited to contact the author.
- Dataset. Minimum 15 landing pages from B2B SaaS products at idea-validation or early-traction stage with active paid acquisition.
- Conditions. A: original page. B: page revised following CLEF-conformant audit. C: page revised following generic LLM with reference prompt (Appendix B). D: page revised following generic LLM with well-crafted conversion-focused prompt.
- Campaign controls. Identical audience targeting, budget, and non-copy creative. Min 7 days × 50 leads per condition.
- Qualitative validation. 3–5 independent CRO experts score each version blind. Cohen's κ > 0.6.
- Primary metric. Cost per lead. Secondary: CTR, signup rate, bounce rate.
9. Conclusion
The conversion copy audit is a ubiquitous practice with no evaluation standard. Founders, growth practitioners, and product teams use large language models to audit and rewrite their landing pages against a background of complete methodological opacity: no defined criteria, no reproducibility, no traceability to the empirical literature on what actually moves conversion rates.
To our knowledge, CLEF is among the first frameworks designed to satisfy all five properties - reproducibility, traceability, content validity, signal sensitivity, practical applicability - simultaneously. It is grounded in 12+ empirical sources with documented conversion effects. The framework is public at the methodological level; the detailed rule set, scoring logic, derivation method, and Kicker pipeline remain proprietary.
The preliminary validation produces a practical signal, not a definitive proof: an average 50% reduction in cost per lead across 15 internal B2B SaaS campaign comparisons. The protocol for producing statistically defensible and externally replicable results at scale is published here and open for contribution.
The framework is open as a methodological standard. Kicker is the proprietary implementation that operationalizes it through structured audit, validated rewrite, micro-budget traffic testing, and outcome interpretation. Researchers are invited to derive independent implementations and test the protocol.
Annotated bibliography
Full annotations of the 12+ sources constituting the empirical corpus. Each entry includes citation, verified URL, key finding, quantified effect where available, and primary CLEF dimension(s) supported.
Morkes, J. & Nielsen, J. (1997). Concise, Scannable and Objective: How to Write for the Web. Nielsen Norman Group.
nngroup.com/articles/concise-scannable-and-objective-how-to-write-for-the-web/
User tests comparing concise, scannable, and objective web copy against promotional equivalents. Combining all three improvements yielded a 124% usability improvement.
Effect: +124% usability · CLEF: Clarity, Specificity
Laja, P. (2022, updated 2025). How to Build a High-Converting Landing Page. CXL.
cxl.com/blog/how-to-build-a-high-converting-landing-page/
Clarity trumps persuasion. One page, one goal, one CTA. High reading-level copy converts 24% less than simpler copy.
Effect: −24% conversion · CLEF: Clarity, Conversion Cues
Shapiro, J. Startup Handbook: Landing Pages. Demand Curve.
demandcurve.com/playbooks/above-the-fold
Conversion Rate = Desire − (Confusion + Friction). Header must pass the mom test.
CLEF: Clarity, Attractiveness, Conversion Cues
Wiebe, J. Copyhackers.
Every line of copy, including testimonials, must support the conversion argument. Specific outcome testimonials convert.
CLEF: Specificity, Conversion Cues, Clarity
Mullin, S. (2023). Social Proof: Definition, Examples & How to Work With It. CXL.
cxl.com/blog/is-social-proof-really-that-important/
Specific outcome testimonials placed near CTA convert. Generic praise does not. CRAVENS framework.
CLEF: Conversion Cues, Specificity
Duistermaat, H. (2016). 17 Words to Avoid in Landing Pages. Unbounce.
unbounce.com/copywriting/17-words-to-stop-using/
"Market-leading", "world-class", "cutting-edge" add no informational value and undermine credibility.
CLEF: Specificity, Differentiation, Clarity
Estes, J. (2013). User-Centric vs. Maker-Centric Language. Nielsen Norman Group.
nngroup.com/articles/user-centric-language/
User-centric copy translates features into benefits in the user's vocabulary.
CLEF: Attractiveness, Clarity
Wang, H. (2024). Homepage Design: 5 Fundamental Principles. NN/g.
nngroup.com/articles/homepage-design/
Failing to communicate site purpose at a glance causes abandonment. Generic CTAs underperform.
CLEF: Clarity, Conversion Cues, Differentiation
Harley, A. (2016). Trustworthiness in Web Design. NN/g.
nngroup.com/articles/trustworthy-design/
Hiding pricing or key information triggers immediate distrust. Vague copy without evidence is treated as evasion.
CLEF: Specificity, Conversion Cues
Remote Marketers Newsletter (2025). Stop Marketing Fluff: Write Copy That Actually Converts.
Replacing "seamless onboarding" with "get started in 3 minutes, no finance degree needed" produced a 29% conversion increase (documented fintech case).
Effect: +29% conversion · CLEF: Specificity, Differentiation
Dominguez, R. (2025). YC Landing Page Formula. The VC Corner.
Every landing page should have one clear action (1:1 attention ratio). Pages with difficult words convert up to 24% less.
Effect: −24% conversion · CLEF: Clarity, Conversion Cues
Gascoigne, J. (2013). How to validate your idea with a Landing Page MVP. Buffer.
A landing page for idea validation must make a specific, falsifiable promise. Email signups alone are insufficient; a second step filtering intent produces more reliable validation data.
CLEF: Clarity, Specificity, Conversion Cues