# DNS1 Full Strategy Explainer

Generated: 2026-05-17

This is the canonical explanatory document for the retained DNS1 final real
one-year backtest package. It explains:

- what the premise of the problem is
- what the solution is
- what the algorithm is
- what the latest version is
- what the next improvement should be

The current retained package is:

`artifacts/final_real_backtest_1y`

The central result table is:

`artifacts/final_real_backtest_1y/all_strategy_results.csv`

The selected strategy manifest is:

`artifacts/final_real_backtest_1y/best_strategy_manifest.json`

The runnable packaged code is:

`artifacts/final_real_backtest_1y/code/scripts/run_real_strategy_sweep.py`

The reusable strategy engine is:

`artifacts/final_real_backtest_1y/code/src/dn_research/strategy_lab.py`

## 1. The Short Version

DNS1 is a research strategy for binary prediction markets. A binary market pays
`1` if a selected outcome happens and `0` if it does not. If the market offers a
side at a price below that side's true probability, the side has positive
expected value before costs. DNS1 tries to identify those cases using historical
resolution behavior, then size a portfolio of positions while controlling gross
exposure, category concentration, and execution-cost drag.

The latest retained version is not the old polished methodology HTML. The old
HTMLs explained the strategy family well, but the current repository has been
reduced to one final package. In this package, the latest version is:

`real-sweep-v1`

The rank-one strategy is:

`broad_50_95__dte1__edge120__yes__maker_base__pos25000__dd3c`

Its current status is explicitly:

`research_selected_not_live`

That phrase matters. The sweep used real historical panel data and realized
resolutions, but it is still a strategy-ranking backtest. It is not yet a live
capital approval. The next improvement is not simply to deploy it. The next
improvement is to harden it with point-in-time market-universe reconstruction,
historical order-book depth, queue and fill modeling, authenticated execution
truth, out-of-sample category holdouts, and stricter market-type controls.

## 2. What The Premise Of The Problem Is

### 2.1 Binary markets are priced probabilities, but not perfect probabilities

In a binary prediction market, there are two complementary claims:

```text
YES pays 1 if the event resolves true, otherwise 0.
NO  pays 1 if the event resolves false, otherwise 0.
```

If the YES price is `0.62`, the market is roughly saying that YES has a 62%
implied probability. If the true probability is 72%, buying YES has a gross edge
of roughly 10 cents. If the true probability is 52%, buying YES is overpriced.

The core edge equation is:

```text
gross_edge = estimated_resolution_probability - executable_side_price
```

After trading frictions:

```text
edge_after_cost = gross_edge - fees - slippage + rebates
```

DNS1 starts from the observation that traded prices are not always perfectly
calibrated. A 50 cent side does not always resolve 50% of the time in every
category, at every time to expiry, under every liquidity condition. A 95 cent
side does not always resolve 95% of the time either. Different market families
can have systematically different resolution behavior.

### 2.2 The real problem is not just predicting a probability

The problem is broader than "find a side with a positive expected value."

The system has to answer all of these questions:

1. Is the apparent edge real, or is it leakage, stale data, or a weak historical
   bucket?
2. Does the edge survive fees and slippage?
3. Is there enough capacity to deploy meaningful notional?
4. Are signals concentrated in one category or one hidden market type?
5. Does the strategy depend on end-of-day snapshots that would not have been
   tradable live?
6. Does the strategy overfit one historical period?
7. Does it survive toxic market types like live sports, esports, exact-score
   contracts, rapidly updating election maps, or markets where public news
   arrives faster than the panel captures?
8. Does it have a clean, deterministic implementation that can be rerun without
   manual judgment?

DNS1 is therefore a combined problem in:

- probability estimation
- data hygiene
- execution modeling
- portfolio construction
- validation
- deployment discipline

### 2.3 Why a naive version fails

A naive strategy would say:

```text
If historical YES probability is above market YES price, buy YES.
If historical NO probability is above market NO price, buy NO.
```

That is not enough.

It fails because:

- historical buckets can be sparse
- market categories can be too broad
- same-day markets can incorporate news after the quote timestamp
- high win rate can hide rare but severe losses
- apparent capacity can vanish once order-book depth is considered
- using future resolved markets in the training set creates lookahead bias
- using final market lists instead of point-in-time market lists creates
  survivorship or discovery bias
- using top-of-book prices without depth exaggerates executable edge
- selecting the best parameter set from a large grid can overfit the tested
  year

The premise of the current package is to test the strategy family more
rigorously than a one-off report by sweeping a full grid of strategy rules over
the same real one-year walk-forward period.

## 3. What The Solution Is

### 3.1 The solution in one sentence

The latest retained solution is a purged walk-forward empirical carry sweep over
real Polymarket panel rows, where each candidate side is scored against an
empirical resolution surface, filtered for edge after modeled costs, sized into
a constrained portfolio, and compared against 1,439 other parameter
combinations in one shared one-year evaluation window.

### 3.2 What "empirical carry" means here

"Carry" means the strategy buys a binary side when the expected resolution value
is above the entry price after cost. If the side is bought for `p` and resolves
to `1`, the gross payoff per share is `1 - p`. If it resolves to `0`, the gross
loss per share is `p`.

The expected value per share is:

```text
expected_value = empirical_probability * 1 + (1 - empirical_probability) * 0
edge = empirical_probability - side_price
```

DNS1 estimates `empirical_probability` from prior resolved markets that share
the same side, side-price bucket, days-to-expiry bucket, and, when enough
evidence exists, main category.

### 3.3 What makes this solution better than a static report

This package improves the research process in five ways:

1. Every strategy is tested on the same evaluation period.
2. The evaluation period is walk-forward, not one global in-sample fit.
3. Training rows are purged and embargoed so resolved future information is not
   casually mixed into the candidate day.
4. The full parameter grid is preserved in one CSV, not only the best row.
5. The retained package includes data, code, methodology docs, and a static
   report surface in one folder.

### 3.4 What the solution is not

The solution is not a live exchange order router.

The solution is not authenticated fill replay.

The solution is not proof that the reported dollar PnL can be captured in live
capital.

The solution is a high-signal research selection step. It identifies which rule
family performed best in the real panel under explicit cost assumptions. It
also gives a reproducible way to rerun and challenge that result.

## 4. The Data Used

The retained package includes the real empirical panel at:

`data/empirical_resolution_panel.parquet`

Panel summary from `panel_manifest.json`:

- rows: 3,093,957
- quote-date start: 2024-09-19
- quote-date end: 2026-04-17
- unique markets: 478,788
- unique token IDs: 452,265

Backtest summary from `fold_manifest.json` and
`best_strategy_manifest.json`:

- walk-forward period start: 2025-06-17T00:00:00+00:00
- walk-forward period end: 2026-04-17T00:00:00+00:00
- backtest-days parameter: 365
- train-lookback-days parameter: 365
- fold size: 30 days
- purge: 30 days
- embargo: 7 days
- walk-forward folds: 11
- max signals per day in the completed run: 500
- strategies tested: 1,440

Important nuance: the nominal `backtest_days` parameter is 365. Because the
available panel ends on 2026-04-17 and the earliest usable fold in this run
starts on 2025-06-17, the realized fold period shown in the retained manifests
is 2025-06-17 through 2026-04-17.

## 5. The Latest Version

### 5.1 Latest retained version name

The latest retained strategy family is:

```text
family: real_panel_grid
version: real-sweep-v1
promotion_state: research_selected_not_live
```

This is the latest version in the current retained package. Previous DNS1
artifacts in the broader workspace explained a live paper-trading stack using
structural interpolation, measured slippage, and MILP allocation. Those older
files are no longer present after the cleanup. The current package should
therefore be treated as the canonical retained object for this run.

### 5.2 Rank-one strategy

The rank-one strategy is:

`broad_50_95__dte1__edge120__yes__maker_base__pos25000__dd3c`

Human translation:

- selected-side price band: 50c to 95c
- max days to expiry: 1 day
- side allowed: YES only
- minimum edge after modeled cost: 12c
- historical evidence threshold: at least 50 observations
- surface: category-aware when category count is sufficient, otherwise overall
- execution profile: maker-base
- max position: $25,000
- target gross: $1,000,000
- max gross: $1,250,000
- max category fraction: 35%
- dynamic sizing: enabled
- exit policy: 3c drawdown-stop proxy

The hold-to-resolution row ranked second with identical metrics. That means the
3c drawdown-stop did not actually change the realized best-strategy path in this
daily panel. It should not be treated as proven downside protection.

### 5.3 Rank-one metrics

From `best_strategy/metrics.json`:

- total PnL: $145,737,193.53
- average daily deployed: $719,528.06
- deployment p05: $437,500.00
- deployment p50: $732,110.47
- deployment p95: $1,000,000.00
- days above $500k deployed: 66.89%
- positive day rate: 99.34%
- annualized Sharpe: 56.29
- max drawdown: -$71,752.26
- max drawdown as fraction of target gross: -7.18%
- trade count: 9,197
- win rate: 86.29%
- total entry notional: $219,456,058.24

These numbers are large enough that they must be treated with skepticism until
the improvement checklist is completed. The right conclusion is not "deploy
immediately." The right conclusion is "this rule is the strongest candidate for
the next audit layer."

### 5.4 What the selected row is actually trading

The selected row is mostly a same-day or one-day YES strategy around 50 cents.
That is important because it is different from a narrow high-price carry thesis.

From the selected trades:

- trade count: 9,197
- signal rows retained before portfolio sizing: 118,885
- entry price 0th percentile: 0.500
- entry price 25th percentile: 0.500
- entry price median: 0.500
- entry price 75th percentile: 0.510
- entry price 95th percentile: 0.685
- entry price max: 0.825
- days-to-expiry 0th percentile: 0
- days-to-expiry median: 0
- days-to-expiry 75th percentile: 1
- days-to-expiry max: 1

The main contribution by category:

| category | trades | PnL | notional |
| --- | ---: | ---: | ---: |
| Crypto-Markets | 5,207 | $112.72M | $126.54M |
| Other-Miscellaneous | 2,764 | $26.75M | $65.06M |
| Economics-Markets | 418 | $1.76M | $9.40M |
| Geopolitics-Conflict | 181 | $1.22M | $4.23M |
| Climate-Environment | 159 | $1.19M | $3.34M |
| Technology-Companies | 167 | $0.80M | $3.87M |

The surface source distribution:

- category surface: 8,368 trades
- overall fallback surface: 829 trades

Exit reasons:

- settlement: 9,175 trades
- forced end of backtest: 22 trades

That profile says the strategy's apparent edge is concentrated in terminal,
near-50c, YES-side opportunities, especially crypto and miscellaneous market
families. Any future deployment review should begin there.

## 6. What The Algorithm Is

This section describes the actual retained algorithm in implementation order.

### 6.1 Load the panel

Implemented in:

`strategy_lab.py::load_panel`

The loader reads the empirical parquet panel and keeps the fields needed for
strategy evaluation:

- token ID
- quote timestamp
- quote date
- YES mark
- market ID
- title
- end date
- numeric resolution
- main category
- subcategory
- volume
- days to expiry

It normalizes quote dates and end dates to UTC dates, coerces numeric fields,
drops rows missing required fields, restricts the date range needed for training
and evaluation, and keeps valid binary prices between `0.000001` and
`0.999999`.

### 6.2 Convert markets into side rows

Implemented in:

`strategy_lab.py::build_side_panel`

Each market row becomes two possible side rows.

YES row:

```text
side = YES
side_price = yes_mark
side_resolution = resolution_value_numeric
```

NO row:

```text
side = NO
side_price = 1 - yes_mark
side_resolution = 1 - resolution_value_numeric
```

This is important because the strategy should evaluate the economic side, not
only the market's displayed YES probability. A market where YES is overpriced
could still produce a NO-side opportunity.

### 6.3 Bucket price and days to expiry

The engine assigns each side row to:

- a side-price bucket
- a days-to-expiry bucket

Price buckets:

```text
0-1%, 1-3%, 3-5%, 5-10%, 10-20%, 20-35%, 35-50%,
50-65%, 65-80%, 80-90%, 90-95%, 95-97%, 97-99%, 99-100%
```

DTE buckets:

```text
0-1d, 1-3d, 3-7d, 7-14d, 14-30d, 30-60d,
60-120d, 120-365d, 365d+
```

These buckets define the empirical surface.

### 6.4 Build purged walk-forward folds

Implemented in:

`strategy_lab.py::purged_walk_forward_splits`

The evaluation advances in folds. For each fold:

1. The test window is the current fold.
2. The training window is before the test window.
3. A 7-day embargo keeps the training end away from the test start.
4. A 30-day purge requires training rows to have `end_date` before the test
   fold starts minus the purge period.

The effective training condition is:

```text
quote_date in [train_start, train_end)
and end_date < fold_start - purge_days
```

The purpose is to reduce lookahead. A market quoted before the test day but
resolved after or too near the test day should not be casually used as resolved
training evidence.

### 6.5 Build empirical resolution surfaces

Implemented in:

`strategy_lab.py::build_surface`

For every fold, the engine groups the training side rows by:

```text
side
side_price_bucket
dte_bucket
```

It computes:

```text
empirical_count = number of observations
empirical_side_probability = mean(side_resolution)
avg_side_price = mean(side_price)
```

If category surfaces are enabled, it also computes the same statistics grouped
by:

```text
main_category
side
side_price_bucket
dte_bucket
```

The category surface is preferred only when it has enough observations. The
selected best strategy requires at least 50 observations. If the category cell
has fewer than 50, the algorithm falls back to the overall surface for that
side, price bucket, and DTE bucket.

This category fallback is one of the most important design decisions. It tries
to use category-specific calibration when it is statistically supported, while
avoiding extreme estimates from tiny category samples.

### 6.6 Score candidate side rows

Implemented in:

`strategy_lab.py::score_candidates`

For each test candidate, the algorithm joins the applicable empirical surface
and computes:

```text
gross_edge = empirical_side_probability - side_price
cost_edge = (entry_slippage_bps + fee_bps - maker_rebate_bps) / 10000
predicted_edge_after_cost = gross_edge - cost_edge
```

For the selected best strategy:

```text
entry_slippage_bps = 30
fee_bps = 25
maker_rebate_bps = 0
cost_edge = 0.0055
```

So the minimum required gross edge is approximately:

```text
0.1200 + 0.0055 = 0.1255
```

The candidate must clear 12c edge after cost, not merely before cost.

### 6.7 Filter by strategy parameters

Implemented in:

`run_real_strategy_sweep.py::filter_signals`

For the best strategy, the candidate must satisfy:

```text
predicted_edge_after_cost >= 0.12
0.50 <= side_price <= 0.95
0 <= days_to_expiry <= 1
empirical_count >= 50
side == YES
cost_edge / gross_edge <= 0.35
```

The last condition is the edge-drag limit. It prevents costs from consuming too
much of the gross edge. For the best strategy the cost edge cannot be more than
35% of the gross edge.

### 6.8 Rank signals within each day

The score used for daily ranking is:

```text
score = predicted_edge_after_cost * log1p(volume)
```

If that score is non-positive, it falls back to:

```text
score = predicted_edge_after_cost
```

The completed run capped each day to the top 500 eligible signals. That cap was
used for tractability and to avoid letting enormous same-day candidate lists
dominate runtime.

### 6.9 Size positions

Implemented in:

`RuleBasedCarryStrategy.size_positions`

The best strategy uses dynamic sizing. The raw position target is:

```text
edge_weight = clip(predicted_edge_after_cost / edge_scale, 0.2, 1.0)
target_notional = max_position_usdc * edge_weight
```

For the selected strategy:

```text
edge_scale = 0.20
max_position_usdc = 25000
min_position_usdc = 1000
```

Then the target notional is clipped between $1,000 and $25,000 and constrained
by remaining portfolio capacity.

### 6.10 Enforce portfolio constraints

Implemented in:

`strategy_lab.py::run_strategy_on_signals`

The portfolio rules include:

- target gross exposure: $1,000,000
- maximum gross exposure: $1,250,000
- maximum category exposure: 35% of max gross, or $437,500
- no duplicate open position in the same market
- no new position once target gross is reached
- no position below $1,000

The daily loop walks through ranked signals and opens positions until the target
gross or category caps bind.

### 6.11 Enter modeled positions

For each accepted signal:

```text
entry_price = side_price
shares = notional / entry_price
entry_cost = notional * (entry_slippage_bps + fee_bps - maker_rebate_bps) / 10000
```

For the selected strategy, entry cost is:

```text
entry_cost = notional * 0.0055
```

The model uses the panel price and explicit cost assumptions. It does not prove
that the exact notional would have filled live at that price. That is a required
next audit layer.

### 6.12 Exit positions

The daily engine closes a position at settlement when:

```text
end_date <= today
```

Settlement close:

```text
exit_price = side_resolution
exit_cost = 0
pnl = shares * exit_price - notional - entry_cost
```

The selected row has `exit_policy = drawdown_stop`, which checks the current
side mark against the entry price and exits if the mark declines by at least 3c.
In this particular winning path, that rule did not alter the outcome relative
to hold-to-resolution. The tied hold row is therefore strong evidence that the
rank-one result is not coming from active stop logic.

### 6.13 Mark daily equity

Every day, the engine computes:

```text
realized_pnl = cumulative closed-position PnL
mtm_value = sum(open shares * current mark - entry cost)
open_cost = sum(open notional)
equity = realized_pnl + mtm_value - open_cost
daily_pnl = equity.diff()
daily_return_on_target = daily_pnl / target_gross_usdc
```

This daily series feeds Sharpe, drawdown, positive-day rate, and deployment
metrics.

### 6.14 Compute metrics

Implemented in:

`strategy_lab.py::compute_metrics`

The engine computes:

- total PnL
- average daily deployed notional
- deployed-notional quantiles
- days above $500k deployed
- positive day rate
- annualized Sharpe
- maximum drawdown
- drawdown as fraction of target gross
- trade count
- win rate
- total entry notional

### 6.15 Rank strategies by objective score

Implemented in:

`strategy_lab.py::objective`

The objective is:

```text
pnl_score = total_pnl / target_daily_notional
deploy_ratio = avg_daily_deployed / target_daily_notional
deploy_penalty = abs(deploy_ratio - 1.0)
underuse_penalty = max(0, 0.75 - deploy_ratio) * 1.5
drawdown_penalty = abs(min(max_drawdown_pct_target, 0)) * 0.75

objective =
  pnl_score
  + 0.25 * annualized_sharpe
  + 0.75 * min(deploy_ratio, 1.25)
  + 0.50 * positive_day_rate
  - deploy_penalty
  - underuse_penalty
  - drawdown_penalty
```

This objective prefers high PnL, high Sharpe, high deployment, a high
positive-day rate, and lower drawdown. It penalizes under-deployment and
deployment far from the target.

## 7. The Strategy Sweep

The sweep grid is defined in:

`run_real_strategy_sweep.py::build_sweep_specs`

The grid is:

- 6 price bands
- 5 max-DTE limits
- 4 minimum-edge thresholds
- 3 side modes
- 2 cost profiles
- 1 max-position setting
- 2 exit policies

Total:

```text
6 * 5 * 4 * 3 * 2 * 1 * 2 = 1440 strategies
```

Price bands:

```text
broad_50_95:      0.50 to 0.95
broad_65_95:      0.65 to 0.95
core_80_95:       0.80 to 0.95
favorite_85_92:   0.85 to 0.92
tight_90_95:      0.90 to 0.95
terminal_95_999:  0.95 to 0.999
```

Max-DTE values:

```text
1, 3, 7, 14, 30 days
```

Minimum edge after cost:

```text
3c, 5c, 8c, 12c
```

Side modes:

```text
both
YES only
NO only
```

Cost profiles:

```text
maker_base:
  execution_style = maker
  entry_slippage_bps = 30
  exit_slippage_bps = 50
  fee_bps = 25
  maker_rebate_bps = 0

taker_stress:
  execution_style = taker
  entry_slippage_bps = 100
  exit_slippage_bps = 100
  fee_bps = 40
  maker_rebate_bps = 0
```

Exit policies:

```text
hold:
  hold_to_resolution

dd3c:
  drawdown_stop with 3c floor
```

The result table keeps one row per strategy and is therefore the main audit
artifact for comparing alternatives.

## 8. What The Best Result Means

### 8.1 The best row is not the original narrow high-price thesis

The old DNS1 intuition often focused on near-expiry, high-price carry. The best
row in this sweep is broader:

```text
50c to 95c YES
0 to 1 day to expiry
12c edge after cost
```

The selected trade distribution shows that it mostly entered around 50c to 51c,
not around 95c. This changes the research interpretation.

The best row is closer to:

> A terminal empirical-resolution YES strategy that finds near-50c markets where
> category or overall historical outcomes imply much higher than 50% resolution
> probability.

It is not mainly:

> A strategy that harvests a few cents from near-certain 95c favorites.

### 8.2 The result is strong enough to deserve more work

The selected row's combination of:

- high deployment
- high positive-day rate
- large PnL
- high Sharpe
- low reported drawdown
- thousands of trades

makes it the clear best candidate in this particular sweep.

### 8.3 The result is also strong enough to be suspicious

The same facts that make the strategy attractive also make it dangerous to
accept too quickly.

Very large backtest PnL in prediction markets can come from:

- point-in-time universe errors
- stale quote rows
- same-day resolution leakage
- using markets after decisive information was known
- duplicate or related markets
- outcome labeling artifacts
- category taxonomy artifacts
- missing historical depth constraints
- non-executable closing marks
- overfitting a parameter grid to one panel

The correct conclusion is:

> This is the selected research candidate. It is not yet a deployable production
> strategy.

## 9. The Most Important Limitations

### 9.1 Execution is modeled, not proven

The backtest uses real panel prices and explicit cost assumptions. It does not
prove that $25,000 positions could have filled at those marks live.

Missing execution realism:

- historical level-two order books
- exact resting depth at the trade timestamp
- queue position
- partial fills
- adverse selection after order placement
- authenticated order and fill attribution
- cancel/replace behavior

### 9.2 The universe must be reconstructed point in time

The backtest must prove that every candidate market was discoverable at the
time it was scored. A final historical panel can accidentally include markets
or metadata in a way that was not available live.

Required checks:

- market was open at quote timestamp
- market was discoverable before entry
- end date was known before entry
- resolution was not known before entry
- quote timestamp preceded any decisive public outcome
- no final-state metadata leaked into the signal

### 9.3 Same-day terminal markets are fragile

The best strategy is mostly DTE 0 or DTE 1. Same-day markets are exactly where
the risk of stale quotes, late information, and resolution-timing mistakes is
highest.

This does not invalidate the result, but it changes the audit priority. The
next audit should focus on same-day quote validity and event timing before
anything else.

### 9.4 Category concentration matters

Crypto-Markets and Other-Miscellaneous dominate the selected PnL. That may be a
real edge, or it may be a taxonomy and data-quality clue.

The next validation should split results by:

- main category
- subcategory
- market title pattern
- event type
- DTE bucket
- price bucket
- volume bucket
- date bucket
- source surface: category versus overall fallback

### 9.5 Grid search can overfit

Testing 1,440 strategies is useful because it shows the parameter landscape.
But selecting the best row from 1,440 rows introduces multiple-testing risk.

The next stage must include:

- locked parameters
- forward-only replay
- category holdouts
- date holdouts
- market-type holdouts
- no manual reselection after seeing results

## 10. What The Improvement Should Be

The next improvement should be a promotion-gate version, not a looser trading
version.

Proposed name:

`real-sweep-v2-execution-audited`

The purpose of v2 is:

> Prove whether the rank-one `real-sweep-v1` edge survives point-in-time
> universe reconstruction, historical book depth, execution realism, category
> holdouts, and stricter toxic-market controls.

### 10.1 Improvement 1: Point-in-time universe reconstruction

For every selected signal, store and verify:

- signal timestamp
- source quote timestamp
- market open timestamp
- market close timestamp
- known end date at signal time
- whether the market was active and tradable
- whether the outcome was unresolved
- whether the market was discoverable through the live discovery path

The required invariant is:

```text
market_discovered_at <= signal_ts
source_quote_ts <= signal_ts
signal_ts < entry_ts
entry_ts < resolution_known_ts
```

If this cannot be proven, the trade should be excluded from the promotion
backtest.

### 10.2 Improvement 2: Historical order-book replay

Replace simple bps assumptions with historical CLOB depth where available.

For each candidate:

- reconstruct bids and asks at entry time
- compute executable VWAP for target size
- enforce minimum depth
- enforce maximum spread
- model partial fill
- model no-fill if depth is insufficient
- compute slippage from actual visible levels

The promotion edge should be:

```text
edge_after_executable_vwap =
  empirical_probability - vwap_entry_price - fees - expected_exit_cost
```

The strategy should be ranked on executable edge, not panel mark edge.

### 10.3 Improvement 3: Same-day leakage audit

Because the selected strategy is concentrated in DTE 0 and DTE 1, v2 must audit
same-day markets aggressively.

Required checks:

- quote timestamp is before event outcome information
- market had not effectively resolved in public information
- title does not encode already-known state
- end date is reliable
- resolution timestamp is available when possible
- stale quotes are removed
- markets with frozen books are removed

This is probably the single most important improvement for this result.

### 10.4 Improvement 4: Toxic-market exclusion and separate modeling

Past DNS1 evidence showed that live sports/esports and other real-time
momentum-style markets can dominate losses. This final sweep's best result is
not visibly dominated by sports, but it is dominated by terminal same-day
markets. That makes market-type controls essential.

v2 should explicitly tag and either exclude or separately model:

- live sports
- esports
- exact-score markets
- in-play game props
- fast election-count markets
- markets with rapidly updating external state
- markets with ambiguous or manually resolved criteria
- markets whose title contains real-time score or time-state language

The goal is not to remove everything risky. The goal is to stop one toxic market
type from hiding inside a broad category label.

### 10.5 Improvement 5: Category and title-pattern holdouts

Do not only rerun the same global period. Validate the selected rule with
holdouts:

- train without Crypto-Markets, test Crypto-Markets
- train without Other-Miscellaneous, test Other-Miscellaneous
- train on all except top title patterns, test those patterns
- train on early months, test later months
- train on later months, test earlier months as a falsification check
- hold out whole subcategories

The question is:

```text
Does the empirical edge generalize, or is it a category-specific artifact?
```

If it is category-specific but real, that is acceptable. Then the strategy
should become category-specific rather than pretending to be broad.

### 10.6 Improvement 6: Uncertainty haircuts

The current surface uses empirical means and count thresholds. v2 should add
uncertainty-aware probabilities.

Candidate methods:

- beta-binomial shrinkage
- Wilson lower bounds
- empirical Bayes category shrinkage
- bootstrap confidence intervals
- minimum effective sample size by category and time bucket

Instead of:

```text
edge = mean_resolution_probability - price - cost
```

Use:

```text
edge = conservative_probability_lower_bound - executable_price - cost
```

This will reduce false positives in sparse or unstable cells.

### 10.7 Improvement 7: Related-market exposure control

The current category cap is useful but coarse. Prediction markets often have
multiple related contracts on the same underlying event.

v2 should add clustering by:

- normalized title
- event slug
- market group
- underlying asset or team
- candidate answer set
- shared close time
- correlated category labels

Then enforce caps at the cluster level, not only the main-category level.

### 10.8 Improvement 8: Forward replay with locked parameters

The rank-one parameters should be frozen before the next test. No retuning after
seeing the next period.

The locked rule is:

```text
price band: 0.50 to 0.95
side: YES
max DTE: 1
min edge after cost: 0.12
min count: 50
category surface: enabled with fallback
cost profile: maker_base or executable VWAP replacement
max position: 25000
target gross: 1000000
max category fraction: 0.35
```

The next forward run should answer:

```text
Does this exact selected rule still work when it is no longer being selected
from a grid?
```

### 10.9 Improvement 9: Paper execution truth

Before any real-money deployment, the system should run the locked strategy in
paper mode and collect:

- signal timestamp
- book snapshot
- intended order size
- simulated order type
- expected fill price
- actual observed post-signal price path
- whether a real order would have crossed the book
- fill/no-fill decision under strict rules
- post-trade mark
- settlement result

The objective is to measure the gap between:

```text
research backtest PnL
paper executable PnL
realistic fill-constrained PnL
```

### 10.10 Improvement 10: Promotion decision thresholds

v2 should define hard promotion criteria before it runs.

Example promotion gates:

- no lookahead violations
- no unresolved timestamp ambiguity in top PnL contributors
- positive PnL after executable VWAP
- positive PnL after uncertainty haircuts
- no single category contributes more than an approved fraction of net PnL
  unless the strategy is explicitly category-scoped
- max drawdown remains inside predeclared budget
- minimum number of trades after all filters
- minimum deployed notional after all filters
- stable performance outside the top 20 markets
- paper execution degradation less than a predeclared threshold

The strategy should not be promoted because it has an impressive backtest. It
should be promoted only if it survives these gates.

## 11. What Should Be Done Next, In Order

### Step 1: Lock the selected v1 parameters

Do not keep searching the 1,440-row grid until there is a better validation
layer. Freeze the rank-one parameters and test those parameters only.

### Step 2: Rebuild signal-level audit tables

For every selected trade, create an audit table with:

- market ID
- token ID
- title
- category
- quote timestamp
- signal date
- entry date
- end date
- resolution timestamp if available
- side
- side price
- empirical probability
- gross edge
- cost edge
- edge after cost
- surface source
- empirical count
- position size
- exit reason
- PnL

### Step 3: Audit the largest PnL contributors manually

Take the top contributors by PnL and verify:

- the quote was real
- the market was open
- the market was not already effectively resolved
- the outcome label is correct
- the category label is correct
- similar markets did not create duplicated exposure

If the top contributors fail, fix the data and rerun before building more
infrastructure.

### Step 4: Add book-depth replay

Replace the static cost profile with executable VWAP and depth filters.

### Step 5: Add uncertainty haircuts

Replace raw empirical mean with conservative probability estimates.

### Step 6: Add market-type exclusions and separate models

Split terminal crypto, miscellaneous one-off events, sports/esports, politics,
macro, and real-time event markets. Do not force one surface to govern all of
them.

### Step 7: Run locked out-of-sample replay

Run the locked v1 rule on fresh data and preserve the result whether it passes
or fails.

### Step 8: Decide promotion state

Only after v2 passes should the promotion state move from:

```text
research_selected_not_live
```

to something stronger, such as:

```text
paper_execution_candidate
```

Real-money deployment would require another gate after paper execution.

## 12. Final Interpretation

The premise is that binary market prices are imperfect probabilities and that
historical resolution behavior can identify repeatable mispricings.

The solution is a real-data, purged walk-forward empirical surface strategy that
selects sides with sufficient edge after modeled cost and sizes them under
portfolio constraints.

The algorithm converts markets into YES and NO side rows, buckets them by price
and DTE, builds historical empirical resolution surfaces, scores current
candidates, filters for edge and evidence, ranks daily opportunities, sizes
positions dynamically, applies portfolio caps, closes at settlement or modeled
stop, computes daily equity, and ranks parameter combinations by a composite
objective.

The latest retained version is `real-sweep-v1`, whose selected research row is
`broad_50_95__dte1__edge120__yes__maker_base__pos25000__dd3c`.

The improvement should be `real-sweep-v2-execution-audited`: a locked-parameter
promotion gate that proves the result survives point-in-time universe
reconstruction, same-day leakage checks, historical order-book depth,
uncertainty haircuts, category and market-type holdouts, related-market exposure
controls, and paper execution truth.

The best current answer is therefore:

> DNS1 has a promising selected rule, but the selected rule is still research.
> The next engineering task is not to make the backtest look better. The next
> engineering task is to make the backtest harder to pass.