This document defines the canonical data schema for Pathspace Lab risk-model portfolio / perturbation framework. It is intentionally project-agnostic and designed to support:
The schema is expressed as an xarray.Dataset and is meant to be stable even as modeling choices evolve.
Daily is canonical All raw data is stored at daily frequency. Weekly/monthly are derived views.
Returns-first Simple returns are stored as raw primitives. Log returns are derived.
Separation of concerns
Bootstrap-friendly
Schema must support resampling along the time dimension without structural mutation.
Lazy + parallel ready Compatible with dask-backed xarray datasets.
All datasets in Pathspace Lab conform to a small set of canonical dimensions. These dimensions are designed to be stable under model extension (e.g. adding new factors, perturbation paths, or simulation horizons).
datetime64[ns]AAPL, MSFT, XOM, JNJfactor)factor_level: categorical
market, sector, style, themefactor_group: identifier for the concrete proxy or construction
SPY, XLK, XLEfactor_source (optional):
etf, statistical, customThis structure allows new factor layers to be introduced without changing dataset shape or downstream logic.
returns(time, asset)
factor_returns(time, factor)
Canonical factors (initial):
Notes:
residual_returns(time, asset)
market_cap(time, asset)
Enables:
volume(time, asset)
dollar_volume(time, asset)
volatility(time, asset)
Stored in ds.attrs:
ds.attrs = {
"data_source": "bwmacro | yfinance | mixed",
"universe": ["AAPL", "MSFT", ...],
"frequency": "D",
"calendar": "NYSE",
"created_at": "ISO-8601",
}
These attributes are critical for reproducibility and auditability.
Key requirement:
All stochastic perturbations operate by resampling along
timeonly.
Implications:
Recommended approach:
Schema fully supports this without mutation.
chunks = {
"time": 252,
"asset": -1,
}
Enables:
This schema underpins:
canonical_perturbations.mdoptimizer_spec.mdIt should change rarely and only with explicit versioning.
These are deferred.
Dimensions:
Coordinates:
Data variables:
Shape returns: (T × A)
Dimensions:
Coordinates:
Data variables:
Shape factor_returns: (T × F)
Dimensions:
Data variables:
Shape beta: (T × A × F) residuals: (T × A)
Dimensions:
Data variables:
Shape returns: (P × T × A)
This data schema is designed to make model fragility, path dependence, and uncertainty explicit rather than incidental. All risk factors—market, sector, or otherwise—are represented along a single factor dimension, with hierarchy expressed only through metadata. This avoids structural refactors as models evolve and allows factor systems to grow organically. Time is treated as a first-class dimension across all datasets, ensuring that resampling, rolling estimation, and simulation remain consistent. Residuals are explicitly preserved as objects rather than discarded as noise, enabling perturbation and alternative path generation. Finally, simulated histories are represented by introducing a path dimension rather than altering model structure, allowing Monte Carlo analysis to emerge naturally from the same data representation used for historical analysis.
Status: Draft v0.1