Files

copilot-swe-agent[bot] 4ee3da3bf7 Add GitHub Copilot research brain configuration and directory structure

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>

2025-11-24 16:35:50 +00:00

README.md

Add GitHub Copilot research brain configuration and directory structure

2025-11-24 16:35:50 +00:00

README.md

Data Directory

⚠️ This directory contains METADATA and REFERENCES only ⚠️

DO NOT commit large datasets, proprietary data, or sensitive information to this repository.

Purpose

This directory documents:

Where actual datasets are stored (external locations)
Small sample datasets for testing and examples
Synthetic datasets for demonstrations
Data schemas and formats

What Goes Here

✅ Allowed:

README.md files describing data sources
Small sample CSVs (< 100 KB, < 1000 rows)
Synthetic example datasets
Data schema specifications
References to external storage (S3, URLs, etc.)
Metadata files (JSON, YAML)

❌ Not Allowed:

Large datasets (> 1 MB)
Raw production data
Personal or sensitive information
API keys or credentials
Proprietary datasets
Binary data files (.pkl, .h5, .parquet)

Directory Structure

data/
├── README.md          # This file
├── samples/           # Small sample datasets for testing
│   └── *.csv
├── examples/          # Synthetic example datasets
│   └── *.csv
└── schemas/           # Data format specifications
    └── *.json

Note: /schemas at the repository root contains JSON schemas for core concepts. This data/schemas/ is for dataset format specifications.

External Data Sources

SIG Embeddings Dataset

Location: s3://blackroad-research/sig-embeddings/v2/
Size: ~10 GB (compressed)
Format: Parquet files partitioned by date
Schema: See /schemas/sig.schema.json
Access: Requires AWS credentials with read access to blackroad-research bucket
Description: Agent embeddings in spiral coordinate system with factor tree annotations
Last Updated: 2025-11-20

QLM Interaction Traces

Location: s3://blackroad-research/qlm-traces/
Size: ~5 GB
Format: JSONL (JSON Lines)
Schema: See schemas/interaction-trace.json (if created)
Access: Requires credentials
Description: Logged agent interactions for QLM analysis
Last Updated: 2025-11-15

Agent Worldline Dataset

Location: External database (contact research team)
Size: ~50 GB
Format: PostgreSQL dump
Access: Database credentials required
Description: Historical agent worldlines with PS-SHA∞ hashes
Last Updated: Ongoing

Sample Data

Small samples from larger datasets are stored in samples/ for:

Quick testing in notebooks
Example code demonstrations
CI/CD pipeline testing

Example:

import pandas as pd

# Load sample data (safe to commit)
df = pd.read_csv('data/samples/sig-sample-100.csv')
print(f"Sample has {len(df)} rows")

# Reference full dataset (not in repo)
# Full dataset: s3://blackroad-research/sig-embeddings/v2/

Synthetic Data

The examples/ directory contains generated synthetic data for:

Demonstrations
Tutorials
Unit tests
Public examples

These datasets are:

Artificially created (not real data)
Safe to share publicly
Reproducible from code

Example:

import numpy as np
import pandas as pd

# Generate synthetic spiral data
def generate_spiral_points(n=100, noise=0.1):
    """Generate synthetic points on a spiral."""
    t = np.linspace(0, 4*np.pi, n)
    r = np.exp(0.1 * t)
    x = r * np.cos(t) + noise * np.random.randn(n)
    y = r * np.sin(t) + noise * np.random.randn(n)
    return pd.DataFrame({'x': x, 'y': y, 't': t, 'r': r})

# Save synthetic data
df = generate_spiral_points(n=200)
df.to_csv('data/examples/spiral-synthetic-200.csv', index=False)

Adding External Data References

When you need to reference a new external dataset:

Add entry to this README under "External Data Sources"
Include:
- Location (URL, S3 path, database connection)
- Size
- Format
- Schema reference
- Access requirements
- Description
- Last updated date
Optionally create a small sample in samples/
Document the schema in /schemas if it's a new format

Privacy & Security

🔒 Never commit:

Personal identifiable information (PII)
API keys, tokens, passwords
Proprietary datasets
Production database contents
Real user data

If you accidentally commit sensitive data:

Immediately delete the file
Rotate any exposed credentials
Notify the team
Rewrite git history if necessary (force push)

Data Usage in Code

Good Practice

# Reference external data
DATA_CONFIG = {
    'source': 's3://blackroad-research/sig-embeddings/v2/',
    'format': 'parquet',
    'size_gb': 10,
    'description': 'Agent embeddings in spiral coordinates'
}

# Load small sample for testing
sample_df = pd.read_csv('data/samples/sig-sample-100.csv')

# Generate synthetic data for demonstrations
synthetic_df = generate_test_data(n_samples=1000)

Bad Practice

# Don't do this - loading huge dataset
# df = pd.read_parquet('data/full-embeddings-10gb.parquet')  # ❌

# Don't do this - committing sensitive data
# api_key = "sk-1234567890"  # ❌
# df.to_csv('data/user-private-data.csv')  # ❌

Research Overview
Experiment Template
Notebook Style Guide
Root /schemas directory for JSON Schemas

README.md

Data Directory

Purpose

What Goes Here

Directory Structure

External Data Sources

SIG Embeddings Dataset

QLM Interaction Traces

Agent Worldline Dataset

Sample Data

Synthetic Data

Adding External Data References

Privacy & Security

Data Usage in Code

Good Practice

Bad Practice

Related Documentation