time-path-fragility

Data Schema Specification

Purpose

This document defines the canonical data schema for Pathspace Lab risk-model portfolio / perturbation framework. It is intentionally project-agnostic and designed to support:

The schema is expressed as an xarray.Dataset and is meant to be stable even as modeling choices evolve.


Design Principles

  1. Daily is canonical All raw data is stored at daily frequency. Weekly/monthly are derived views.

  2. Returns-first Simple returns are stored as raw primitives. Log returns are derived.

  3. Separation of concerns

    • Raw data ≠ factor estimation ≠ optimization
    • Estimation choices (Huber, OLS, etc.) are modular
  4. Bootstrap-friendly Schema must support resampling along the time dimension without structural mutation.

  5. Lazy + parallel ready Compatible with dask-backed xarray datasets.


Core Dimensions

All datasets in Pathspace Lab conform to a small set of canonical dimensions. These dimensions are designed to be stable under model extension (e.g. adding new factors, perturbation paths, or simulation horizons).

time

asset

factor

factor-level metadata (coordinates on factor)

path (optional)

This structure allows new factor layers to be introduced without changing dataset shape or downstream logic.


Required Variables

Asset-Level Returns

returns(time, asset)

Factor Returns

factor_returns(time, factor)

Canonical factors (initial):

Notes:


Residual Returns (Optional / Derived)

residual_returns(time, asset)

Market Capitalization

market_cap(time, asset)

Trading Volume / Dollar Volume

volume(time, asset)
dollar_volume(time, asset)

Volatility Estimates (Derived)

volatility(time, asset)

Attributes (Dataset Metadata)

Stored in ds.attrs:

ds.attrs = {
    "data_source": "bwmacro | yfinance | mixed",
    "universe": ["AAPL", "MSFT", ...],
    "frequency": "D",
    "calendar": "NYSE",
    "created_at": "ISO-8601",
}

These attributes are critical for reproducibility and auditability.


Bootstrap & Monte Carlo Compatibility

Key requirement:

All stochastic perturbations operate by resampling along time only.

Implications:

Recommended approach:

  1. Sample factor return paths
  2. Sample residual paths conditional on factor realization
  3. Reconstruct asset returns

Schema fully supports this without mutation.


Parallelization Considerations

chunks = {
    "time": 252,
    "asset": -1,
}

Relationship to Other Specs

This schema underpins:

It should change rarely and only with explicit versioning.


Open Design Decisions (Explicit)

These are deferred.


Example Dataset Structures

Canonical Dataset

Dimensions:

Coordinates:

Data variables:

Shape returns: (T × A)

Factor Returns Dataset

Dimensions:

Coordinates:

Data variables:

Shape factor_returns: (T × F)

Factor Loadings Dataset

Dimensions:

Data variables:

Shape beta: (T × A × F) residuals: (T × A)

Simulated / Bootstrapped Dataset

Dimensions:

Data variables:

Shape returns: (P × T × A)


Rationale

This data schema is designed to make model fragility, path dependence, and uncertainty explicit rather than incidental. All risk factors—market, sector, or otherwise—are represented along a single factor dimension, with hierarchy expressed only through metadata. This avoids structural refactors as models evolve and allows factor systems to grow organically. Time is treated as a first-class dimension across all datasets, ensuring that resampling, rolling estimation, and simulation remain consistent. Residuals are explicitly preserved as objects rather than discarded as noise, enabling perturbation and alternative path generation. Finally, simulated histories are represented by introducing a path dimension rather than altering model structure, allowing Monte Carlo analysis to emerge naturally from the same data representation used for historical analysis.


Status: Draft v0.1