1. Introduction

1.1 The practical problem

Founders building early-stage products routinely use large language models to evaluate and rewrite their landing pages. The workflow is simple: paste the page content, ask for feedback. The output varies from a bullet list of stylistic suggestions to a partial rewrite with no stated criteria. There is no standard for what a high-quality copy audit looks like, no rubric for evaluating the auditor, and no mechanism for knowing whether following the recommendations improves conversion.

This is not a niche problem. Landing page copy is one of the highest-leverage variables in early customer acquisition. A 20% improvement in conversion rate on a landing page has the same effect on pipeline as a 20% increase in ad spend, at zero marginal cost. Yet the tools used to optimize that copy operate without an evaluation framework.

1.2 The measurement problem

LLM benchmarks have evolved across four generations. From MMLU and SuperGLUE through HELM and BIG-Bench to Humanity's Last Exam, and now toward domain-specific instruments like SalesLLM, AD-Bench, and HealthBench, a consistent pattern emerges: general benchmarks saturate quickly and fail to predict performance on specialized tasks. Writing quality benchmarks measure fluency, coherence, and human preference - none of which are proven proxies for conversion impact.

Three specific gaps characterize the current state. First, reproducibility: a copy audit produced by a generic LLM varies between runs because the model changes without public changelog and the prompt is not versioned. Second, traceability: when a generic LLM flags a headline as weak, it provides no reference to a conversion research finding. Third, outcome validity: no existing benchmark measures whether following an LLM's copy recommendations produces a measurable improvement in cost per lead, signup rate, or any other conversion metric.

1.3 The contribution of this paper

We make three contributions. First, we map the existing landscape of LLM writing and marketing benchmarks against five properties we argue are necessary for a valid copy audit framework. Second, we propose CLEF, a framework defined by those five properties and grounded in a verified corpus of empirical conversion research. Third, we describe the conditions a CLEF-conformant implementation must satisfy and the protocol required to validate the framework at scale.

1.4 From CLEF to Kicker

Kicker operationalizes CLEF as a product workflow. The framework defines what a valid conversion-grounded landing page audit must evaluate; Kicker turns that framework into an executable process: landing page extraction, rule-based diagnosis, dimension-level scoring, rewrite generation, quality-gated recommendations, micro-budget traffic testing, and post-test interpretation.

This distinction matters. CLEF is the methodological layer. Kicker is the applied layer: it uses these principles to help early-stage founders move from subjective copy feedback to measurable demand signals such as signup rate, cost per lead, and audience-level traction.

2. State of the art

2.1 Four generations of LLM benchmarks

The history of LLM evaluation reflects a recurring dynamic: each generation of benchmarks is created to address the limitations of the previous one, and saturates as models improve faster than benchmark designers anticipated.

2.2 Existing benchmarks for writing and marketing copy

EQ-Bench Creative Writing evaluates narrative quality, emotional depth, and prose style. Claude Opus 4.7 leads with an Elo score of 2216. It measures literary quality, not conversion-specific effectiveness.

WritingBench (NeurIPS 2025) covers six writing domains including Advertising and Marketing. It evaluates 15 models across 1,239 queries using five instance-specific criteria dynamically generated by the evaluation model itself - precisely what CLEF argues against.

LMArena Text measures crowd-sourced human preference. AD-Bench (2026) evaluates analytical task performance in advertising contexts, not copy quality. SalesLLM benchmarks persuasive capability in multi-turn dialogues, not static copy audit.

2.3 The reproducibility problem

Closed model benchmarks face a structural reproducibility challenge. A preregistered longitudinal study tracking three major model families over ten weekly waves found divergent stability trajectories: one model stable, one improving, one degrading mid-study. This model drift, combined with the absence of prompt versioning in generic LLM interfaces, means that a copy audit produced today is not reproducible tomorrow.

2.4 The empirical anchoring problem

The empirical literature on conversion copy has documented specific, quantified effects. Benchmarks that do not encode these findings cannot evaluate whether an LLM audit applies them.

Documented effects in the empirical corpus
Objective vs promotional copy
NN/g, 1997
+124%
Buzzword → specific outcome
Remote Marketers, 2025
+29%
High reading-level copy
Unbounce, 57M conversions
−24%
Complex language (mom test)
YC / VC Corner, 2025
−24%

Quantified conversion effects from the CLEF empirical corpus.

2.5 Gap analysis: existing benchmarks vs CLEF properties

The following matrix maps five necessary properties of a valid copy audit framework against leading existing benchmarks and CLEF.

Property EQ-Bench / LMArena WritingBench AD-Bench CLEF
Reproducibility across runs No No Partial Yes
Traceability to empirical sources No No Partial Yes
Conversion-specific dimensions No Partial Partial Yes
Quality gate on rewrites No No No Yes
Outcome validity (CPL / signup) No No No Partial
Practical applicability Yes Partial Yes Yes

The gap is structural, not incidental. Existing benchmarks were not designed to evaluate conversion copy audit. CLEF is designed for exactly that purpose.

3. Theoretical foundations of CLEF

3.1 The empirical corpus

CLEF is grounded in 12+ primary sources with documented empirical findings on conversion copy. Each source must present a verifiable finding with a measurable effect, must be publicly accessible for independent verification, and must be from a recognized practitioner or research institution.

Source Key finding Effect CLEF dimension(s)
Morkes & Nielsen, NN/g, 1997 Objective copy outperforms promotional copy on usability +124% Clarity, Specificity
Unbounce, 57M conversions High reading-level copy converts less than simpler copy −24% Clarity
Remote Marketers, 2025 Replacing buzzwords with specific outcomes (fintech case) +29% Specificity, Differentiation
Peep Laja, CXL, 2022–2025 Clarity trumps persuasion. One goal, one CTA. Framework Clarity, Conversion Cues
Julian Shapiro, Demand Curve CR = Desire − (Confusion + Friction). Mom test required. Formula Clarity, Attractiveness
Joanna Wiebe, Copyhackers Every line must support the conversion argument. Rule Specificity, Conversion Cues
Shanelle Mullin, CXL, 2023 Specific outcome testimonials near CTA lift conversion. Placement Conversion Cues, Specificity
Henneke Duistermaat, Unbounce Banning buzzword categories forces specificity. Anti-rules Differentiation, Clarity
NN/g, Estes, 2013 User-centric language over maker-centric. Rule Attractiveness, Clarity
NN/g, Wang, 2024 Purpose must be communicated at a glance. Rule Clarity, Conversion Cues
NN/g, Harley, 2016 Up-front disclosure builds credibility. Rule Specificity, Conversion Cues
YC / Dominguez, VC Corner, 2025 1:1 attention ratio. Complex copy hurts CVR. Multiple Clarity, Conversion Cues
Joel Gascoigne, Buffer, 2013 Landing page must make a falsifiable promise. Principle Clarity, Specificity

3.2 The five CLEF dimensions

Each dimension is defined at the level of behavioral mechanism. The specific criteria, formulation, weighting, and interaction logic constitute the proprietary implementation and are not published in this paper.

Clarity

Clarity measures the speed and accuracy with which a first-time visitor grasps what the product does, who it serves, and what they will get. Users who do not understand an offer within the first few seconds abandon the page. Clarity is the precondition for all other dimensions.

Attractiveness

Attractiveness measures the degree to which the copy engages the visitor's emotional drivers - desires, fears, aspirations, frustrations - and creates a sense that the offer is relevant and valuable to them. Users who understand an offer but feel no pull toward it do not convert.

Specificity

Specificity measures the degree to which claims are concrete, verifiable, and anchored in evidence rather than assertion. Specific claims are more credible than general claims, and credibility is a precondition for trust, which is a precondition for conversion. Replacing vague claims with specific outcome statements produced a 29% conversion increase in a documented case.

Conversion Cues

Conversion Cues measures how effectively the page architecture and micro-copy guide the visitor toward the primary action. CTA visibility and wording, friction reduction, placement of social proof relative to moments of hesitation, and the absence of competing CTAs.

Differentiation

Differentiation measures the degree to which the page communicates what is distinctively valuable about this offer relative to alternatives - direct competitors, substitutes, and the status quo. A visitor who does not see why this offer rather than another defaults to inaction.

3.3 Justification of the five-dimension structure

The five dimensions were selected by a single criterion: inclusion requires documented empirical evidence of a measurable effect on conversion outcomes. Dimensions based solely on aesthetic quality, stylistic preference, or writing craft were excluded regardless of editorial merit. This constraint is what distinguishes CLEF from a style guide and grounds it in the empirical literature rather than practitioner opinion.

4. The five properties of CLEF

CLEF is defined by five verifiable properties. A framework claiming to evaluate conversion copy must satisfy all five to be valid.

4.1 Reproducibility

Definition. Two independent executions of a CLEF-conformant audit on the same page, using the same prompt version, must produce scores within a defined tolerance interval. We propose a tolerance of ±3 points per dimension on a 20-point scale.

Why existing benchmarks fail this property. Generic LLM interfaces do not version prompts. The underlying model changes without announcement. Evaluation criteria are generated dynamically (WritingBench) or through crowd votes that shift over time (LMArena). A framework built on generic LLM calls without prompt stability guarantees cannot satisfy reproducibility.

4.2 Traceability

Definition. Each diagnostic signal produced by a CLEF-conformant audit must be linkable to a specific rule, and each rule must be linkable to a specific source in the empirical corpus. The chain is: output signal → rule → source → measured effect.

Why existing benchmarks fail. Generic LLM outputs produce suggestions without stating criteria. WritingBench generates criteria dynamically per query. EQ-Bench produces narrative quality scores without linking them to conversion research. None provides a traceable chain.

4.3 Content validity

Definition. The dimensions evaluated must collectively cover the conversion-relevant dimensions documented in the empirical literature, without including dimensions not empirically grounded in conversion outcomes.

Why existing benchmarks fail. EQ-Bench and WritingBench include dimensions like emotional depth and narrative arc that have no documented relationship to conversion rate. Including them in a conversion copy audit is a validity failure: the framework measures something other than what it claims to measure.

4.4 Signal sensitivity

Definition. A CLEF-conformant framework must produce meaningfully different scores for pages that differ in conversion-relevant quality. A framework that assigns similar scores to a high-performing and a low-performing landing page provides no useful signal.

4.5 Practical applicability

Definition. A CLEF-conformant audit must be executable without requiring external human judges, must produce output actionable by a non-expert practitioner, and must run within a time and cost envelope compatible with early-stage product development. Verification: audit completes in under 5 minutes, costs under €1 per page in API costs, output actionable for a founder with no conversion optimization background.

5. Conditions for a CLEF-conformant implementation

5.1 Required architecture

A CLEF-conformant implementation requires three sequential phases. Each is necessary; removing any one produces a framework that fails one or more of the five properties.

5.2 Required properties of the rule set

5.3 What a CLEF-conformant implementation must not do

5.4 Public framework vs proprietary implementation

CLEF separates the public methodological standard from the proprietary implementation used by Kicker. The public framework consists of the five evaluation dimensions, the five validity properties, the minimum conditions for a conformant audit, and the benchmark protocol. The proprietary layer consists of the detailed rulebook, the exact scoring logic, the weighting and interaction rules, the prompt and schema design, the rewrite-validation pipeline, and the operational data produced by Kicker campaigns.

6. Preliminary validation

6.1 Reproducibility across runs

The reference implementation was tested for reproducibility across five independent runs on ten B2B SaaS landing pages. Dimension scores were consistent within the tolerance threshold (±3 points per dimension) for 94% of dimension-page combinations. The remaining 6% of cases showed variance on the Attractiveness dimension, which involves the most interpretive sub-criteria.

94%
within tolerance

Reproducibility verified

Across 5 independent runs × 10 B2B SaaS landings = 50 audits.

94% of dimension-page combinations stayed within the ±3-point tolerance threshold on a 20-point scale.

6.2 Cost per lead result

Campaign context. The internal validation was conducted across 15 B2B SaaS campaign comparisons in the founder and early-stage product validation category. Campaigns ran on Meta Ads with fixed budgets and consistent target audiences across conditions.

Methodology. For each comparison, Condition A used the original landing page without modification. Condition B used a landing page revised following a full CLEF-conformant audit. All other variables (audience targeting, budget, ad creative format, campaign duration) were held as constant as operationally possible.

Result. Across the 15 comparisons, average CPL moved from approximately $8 in the baseline condition to approximately $4 after CLEF-conformant audit and rewrite - average reduction: 50%.

$8
Baseline (no audit)
$4
After CLEF-conformant audit
−50%
Average CPL reduction · N=15 B2B SaaS campaigns · May 2026

Disclaimer. N=15. Internal validation, not externally replicated. Presented as a practical signal that CLEF-conformant audits can produce meaningful outcome changes in real campaigns, not as proof that they generally produce 50% CPL reductions.

7. Model comparison: Kicker vs frontier LLMs

This comparison evaluates GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro against the five CLEF properties. All three are frontier models with documented strong performance on general writing benchmarks. The gap is not about raw capability. It is structural: none of them satisfy the CLEF properties by design, regardless of how good their outputs are.

CLEF property GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro Kicker
Prompt versioned and stable No No No Yes
Score variance across runs Undocumented Undocumented Undocumented 94% within tolerance
Rules linked to empirical sources No No No Yes
Output cites rule broken No No No Yes
Dimensions grounded in conversion research No Partial Partial Yes
Excludes non-conversion dimensions No No No Yes
Rewrites validated before output No No No Yes
Length guard on rewrites None None None ±30% max
Structured /100 score, 5 dimensions No No No Yes
Output actionable for non-expert founder Partial Partial Partial Yes

The frontier LLMs lead on general writing benchmarks: Claude Opus 4.7 sits at #1 on EQ-Bench Creative Writing (Elo 2216), Gemini 3.1 Pro ranks #1 of 544 models on the Artificial Analysis Intelligence Index. CLEF does not measure literary quality. It measures conversion-specific audit validity.

8. Limitations and future work

8.1 Current limitations

8.2 The benchmark protocol

The following protocol is designed to produce statistically defensible results comparing CLEF-conformant audits against generic LLM audits and against baseline. Researchers wishing to contribute to the dataset are invited to contact the author.

9. Conclusion

The conversion copy audit is a ubiquitous practice with no evaluation standard. Founders, growth practitioners, and product teams use large language models to audit and rewrite their landing pages against a background of complete methodological opacity: no defined criteria, no reproducibility, no traceability to the empirical literature on what actually moves conversion rates.

To our knowledge, CLEF is among the first frameworks designed to satisfy all five properties - reproducibility, traceability, content validity, signal sensitivity, practical applicability - simultaneously. It is grounded in 12+ empirical sources with documented conversion effects. The framework is public at the methodological level; the detailed rule set, scoring logic, derivation method, and Kicker pipeline remain proprietary.

The preliminary validation produces a practical signal, not a definitive proof: an average 50% reduction in cost per lead across 15 internal B2B SaaS campaign comparisons. The protocol for producing statistically defensible and externally replicable results at scale is published here and open for contribution.

The framework is open as a methodological standard. Kicker is the proprietary implementation that operationalizes it through structured audit, validated rewrite, micro-budget traffic testing, and outcome interpretation. Researchers are invited to derive independent implementations and test the protocol.


Annotated bibliography

Full annotations of the 12+ sources constituting the empirical corpus. Each entry includes citation, verified URL, key finding, quantified effect where available, and primary CLEF dimension(s) supported.

Morkes, J. & Nielsen, J. (1997). Concise, Scannable and Objective: How to Write for the Web. Nielsen Norman Group.

nngroup.com/articles/concise-scannable-and-objective-how-to-write-for-the-web/

User tests comparing concise, scannable, and objective web copy against promotional equivalents. Combining all three improvements yielded a 124% usability improvement.
Effect: +124% usability · CLEF: Clarity, Specificity

Laja, P. (2022, updated 2025). How to Build a High-Converting Landing Page. CXL.

cxl.com/blog/how-to-build-a-high-converting-landing-page/

Clarity trumps persuasion. One page, one goal, one CTA. High reading-level copy converts 24% less than simpler copy.
Effect: −24% conversion · CLEF: Clarity, Conversion Cues

Shapiro, J. Startup Handbook: Landing Pages. Demand Curve.

demandcurve.com/playbooks/above-the-fold

Conversion Rate = Desire − (Confusion + Friction). Header must pass the mom test.
CLEF: Clarity, Attractiveness, Conversion Cues

Wiebe, J. Copyhackers.

Every line of copy, including testimonials, must support the conversion argument. Specific outcome testimonials convert.
CLEF: Specificity, Conversion Cues, Clarity

Mullin, S. (2023). Social Proof: Definition, Examples & How to Work With It. CXL.

cxl.com/blog/is-social-proof-really-that-important/

Specific outcome testimonials placed near CTA convert. Generic praise does not. CRAVENS framework.
CLEF: Conversion Cues, Specificity

Duistermaat, H. (2016). 17 Words to Avoid in Landing Pages. Unbounce.

unbounce.com/copywriting/17-words-to-stop-using/

"Market-leading", "world-class", "cutting-edge" add no informational value and undermine credibility.
CLEF: Specificity, Differentiation, Clarity

Estes, J. (2013). User-Centric vs. Maker-Centric Language. Nielsen Norman Group.

nngroup.com/articles/user-centric-language/

User-centric copy translates features into benefits in the user's vocabulary.
CLEF: Attractiveness, Clarity

Wang, H. (2024). Homepage Design: 5 Fundamental Principles. NN/g.

nngroup.com/articles/homepage-design/

Failing to communicate site purpose at a glance causes abandonment. Generic CTAs underperform.
CLEF: Clarity, Conversion Cues, Differentiation

Harley, A. (2016). Trustworthiness in Web Design. NN/g.

nngroup.com/articles/trustworthy-design/

Hiding pricing or key information triggers immediate distrust. Vague copy without evidence is treated as evasion.
CLEF: Specificity, Conversion Cues

Remote Marketers Newsletter (2025). Stop Marketing Fluff: Write Copy That Actually Converts.

Replacing "seamless onboarding" with "get started in 3 minutes, no finance degree needed" produced a 29% conversion increase (documented fintech case).
Effect: +29% conversion · CLEF: Specificity, Differentiation

Dominguez, R. (2025). YC Landing Page Formula. The VC Corner.

Every landing page should have one clear action (1:1 attention ratio). Pages with difficult words convert up to 24% less.
Effect: −24% conversion · CLEF: Clarity, Conversion Cues

Gascoigne, J. (2013). How to validate your idea with a Landing Page MVP. Buffer.

A landing page for idea validation must make a specific, falsifiable promise. Email signups alone are insufficient; a second step filtering intent produces more reliable validation data.
CLEF: Clarity, Specificity, Conversion Cues