Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
5.3 KiB
5.3 KiB
Data Directory
⚠️ This directory contains METADATA and REFERENCES only ⚠️
DO NOT commit large datasets, proprietary data, or sensitive information to this repository.
Purpose
This directory documents:
- Where actual datasets are stored (external locations)
- Small sample datasets for testing and examples
- Synthetic datasets for demonstrations
- Data schemas and formats
What Goes Here
✅ Allowed:
README.mdfiles describing data sources- Small sample CSVs (< 100 KB, < 1000 rows)
- Synthetic example datasets
- Data schema specifications
- References to external storage (S3, URLs, etc.)
- Metadata files (JSON, YAML)
❌ Not Allowed:
- Large datasets (> 1 MB)
- Raw production data
- Personal or sensitive information
- API keys or credentials
- Proprietary datasets
- Binary data files (
.pkl,.h5,.parquet)
Directory Structure
data/
├── README.md # This file
├── samples/ # Small sample datasets for testing
│ └── *.csv
├── examples/ # Synthetic example datasets
│ └── *.csv
└── schemas/ # Data format specifications
└── *.json
Note: /schemas at the repository root contains JSON schemas for core concepts. This data/schemas/ is for dataset format specifications.
External Data Sources
SIG Embeddings Dataset
- Location:
s3://blackroad-research/sig-embeddings/v2/ - Size: ~10 GB (compressed)
- Format: Parquet files partitioned by date
- Schema: See
/schemas/sig.schema.json - Access: Requires AWS credentials with read access to
blackroad-researchbucket - Description: Agent embeddings in spiral coordinate system with factor tree annotations
- Last Updated: 2025-11-20
QLM Interaction Traces
- Location:
s3://blackroad-research/qlm-traces/ - Size: ~5 GB
- Format: JSONL (JSON Lines)
- Schema: See
schemas/interaction-trace.json(if created) - Access: Requires credentials
- Description: Logged agent interactions for QLM analysis
- Last Updated: 2025-11-15
Agent Worldline Dataset
- Location: External database (contact research team)
- Size: ~50 GB
- Format: PostgreSQL dump
- Access: Database credentials required
- Description: Historical agent worldlines with PS-SHA∞ hashes
- Last Updated: Ongoing
Sample Data
Small samples from larger datasets are stored in samples/ for:
- Quick testing in notebooks
- Example code demonstrations
- CI/CD pipeline testing
Example:
import pandas as pd
# Load sample data (safe to commit)
df = pd.read_csv('data/samples/sig-sample-100.csv')
print(f"Sample has {len(df)} rows")
# Reference full dataset (not in repo)
# Full dataset: s3://blackroad-research/sig-embeddings/v2/
Synthetic Data
The examples/ directory contains generated synthetic data for:
- Demonstrations
- Tutorials
- Unit tests
- Public examples
These datasets are:
- Artificially created (not real data)
- Safe to share publicly
- Reproducible from code
Example:
import numpy as np
import pandas as pd
# Generate synthetic spiral data
def generate_spiral_points(n=100, noise=0.1):
"""Generate synthetic points on a spiral."""
t = np.linspace(0, 4*np.pi, n)
r = np.exp(0.1 * t)
x = r * np.cos(t) + noise * np.random.randn(n)
y = r * np.sin(t) + noise * np.random.randn(n)
return pd.DataFrame({'x': x, 'y': y, 't': t, 'r': r})
# Save synthetic data
df = generate_spiral_points(n=200)
df.to_csv('data/examples/spiral-synthetic-200.csv', index=False)
Adding External Data References
When you need to reference a new external dataset:
- Add entry to this README under "External Data Sources"
- Include:
- Location (URL, S3 path, database connection)
- Size
- Format
- Schema reference
- Access requirements
- Description
- Last updated date
- Optionally create a small sample in
samples/ - Document the schema in
/schemasif it's a new format
Privacy & Security
🔒 Never commit:
- Personal identifiable information (PII)
- API keys, tokens, passwords
- Proprietary datasets
- Production database contents
- Real user data
If you accidentally commit sensitive data:
- Immediately delete the file
- Rotate any exposed credentials
- Notify the team
- Rewrite git history if necessary (force push)
Data Usage in Code
Good Practice
# Reference external data
DATA_CONFIG = {
'source': 's3://blackroad-research/sig-embeddings/v2/',
'format': 'parquet',
'size_gb': 10,
'description': 'Agent embeddings in spiral coordinates'
}
# Load small sample for testing
sample_df = pd.read_csv('data/samples/sig-sample-100.csv')
# Generate synthetic data for demonstrations
synthetic_df = generate_test_data(n_samples=1000)
Bad Practice
# Don't do this - loading huge dataset
# df = pd.read_parquet('data/full-embeddings-10gb.parquet') # ❌
# Don't do this - committing sensitive data
# api_key = "sk-1234567890" # ❌
# df.to_csv('data/user-private-data.csv') # ❌
Related Documentation
- Research Overview
- Experiment Template
- Notebook Style Guide
- Root
/schemasdirectory for JSON Schemas