Test one thing at a time
The clearest experiments test a single change:- Good: Testing a new headline against the current one
- Less clear: Testing a new headline, different button color, and new layout together
Start with a hypothesis
Before creating an experiment, write down:- What you’re changing: “We’re testing a shorter headline”
- What you expect: “We expect higher engagement because it’s easier to read”
- How you’ll measure: “We’ll track form submissions as the primary goal”
Choose the right goal
Your primary goal should directly measure what you’re trying to improve:| Testing | Good primary goal |
|---|---|
| Headline copy | Form submissions or link clicks |
| Page layout | Scroll depth or time on page |
| Call-to-action | Button clicks or form submissions |
| Overall conversion | Form submissions or external purchases |
Wait for enough data
Reliable results need:- At least 100 visitors per variant: Smaller samples are too noisy
- At least 14 days: Captures weekly patterns in traffic
- 95% probability threshold: Indicates statistical significance
Avoid peeking and stopping early
Checking results frequently and stopping when something “looks” significant leads to false positives. Instead:- Set your end condition upfront (significance or duration)
- Enable auto-complete if you trust the statistical thresholds
- Resist the urge to stop early when one variant is ahead
Watch for integrity warnings
Blox alerts you to events that could affect data quality:Variant republished during experiment
If you edit and republish a variant while the experiment is running, the data before and after the change isn’t comparable. The warning notes when this happened. Best practice: Avoid editing variants during experiments. If you must make changes, consider restarting the experiment.Experiment paused
Pausing creates a gap in data collection and can introduce bias if the pause happened during unusual traffic patterns. Best practice: Minimize pauses. If you need to pause, note why and consider whether results are still valid.Traffic allocation
Equal splits (50/50 for two variants) maximize statistical power. Unequal splits are useful when:- You want to minimize exposure to a risky change (e.g., 90/10)
- You’re confident in a change and want most visitors to see it
Document your experiments
Keep records of:- What you tested and why
- The results and statistical confidence
- What action you took (implemented winner, ran follow-up test, etc.)
- Learnings for future experiments
Sequential testing
If your first experiment isn’t conclusive:- Analyze why (not enough traffic, too small a change, wrong goal)
- Form a new hypothesis based on learnings
- Create a new experiment with refined variants
When to trust results
High confidence in results comes from:- Large sample sizes (hundreds or thousands of visitors per variant)
- Consistent performance over time (not just a spike)
- Results that align with your hypothesis
- Meaningful effect sizes (not just statistically significant but practically important)
- Small sample sizes
- Erratic performance over time
- Results that contradict your hypothesis without explanation
- Very small effect sizes that may not matter in practice
After the experiment
When you have a winner:- Set it as the default variant
- Archive or remove losing variants
- Document what you learned
- Plan your next experiment