Cold email testing is the same as any direct response testing: one variable at a time, large enough sample, long enough window, measurement that reflects real outcomes. Most teams A/B test badly, declare winners early, and optimize toward noise.
50/50 split by default. For risky challengers (new copy that could tank reply rate), 70/30 in favor of the control until you see early signal.
500 emails × 3% reply rate = 15 replies per variant. A 1-reply difference between variants is noise, not signal. For small-sample cold email tests, you need large differences (50%+ lift) before you can confidently say one variant won.
Rule of thumb: if your test produced under 10 replies per variant, you don't have enough data. Keep running or increase volume.
Overall reply rate includes "unsubscribe," "wrong person," "not interested." These aren't conversions. Positive reply rate (interested, want to talk, send me more info) is the metric that correlates to pipeline.
Track both, overall reply rate tells you about deliverability and subject line effectiveness, positive reply rate tells you about offer/copy quality.
"Let's test new subject + new first line + new CTA." Result: you don't know which change moved the needle. Can't scale what worked.
Day 2 variant A has 5% vs variant B's 3%. "A wins!" Run to full sample. Often the gap closes or reverses.
Testing on 100 prospects and acting on the results. You're optimizing toward noise.
Tweaking copy partway through invalidates results. Wait for test completion before iterating.
Launching a new campaign without a control to compare against. You can't tell if it's working compared to what.
Covered in more depth on scientific testing in the direct response section, the same principles apply to cold email, VSLs, landing pages, and ads.