Files
simulation-theory/scrapers/README.md
2026-02-25 18:20:10 +00:00

2.2 KiB
Raw Blame History

Scrapers

Python web scrapers for collecting data relevant to the simulation-theory research repository.

Scrapers

Script Source Topics
arxiv_scraper.py arXiv Simulation hypothesis, Gödel incompleteness, Riemann zeta, qutrit/ternary quantum, halting problem, IIT consciousness
wikipedia_scraper.py Wikipedia SHA-256, Riemann hypothesis, quantum computing, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain
oeis_scraper.py OEIS Prime numbers, Fibonacci, pi digits, EulerMascheroni constant, Catalan numbers, partition numbers

Setup

pip install -r requirements.txt

Usage

arXiv scraper

# Use default topic list
python arxiv_scraper.py

# Custom query, limit to 3 results per query
python arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3

# Save to file
python arxiv_scraper.py --output arxiv_results.json

Wikipedia scraper

# Use default topic list
python wikipedia_scraper.py

# Custom topics
python wikipedia_scraper.py --topics "Riemann hypothesis" "SHA-2" "Turing machine"

# Save to file
python wikipedia_scraper.py --output wikipedia_results.json

OEIS scraper

# Use default sequence list
python oeis_scraper.py

# Custom sequence IDs
python oeis_scraper.py --ids A000040 A000045 A000796

# Save to file
python oeis_scraper.py --output oeis_results.json

Output format

All scrapers output JSON to stdout by default, or to a file with --output.

arXiv — dict keyed by query, each value is a list of:

{
  "title": "...",
  "authors": ["..."],
  "published": "2024-01-01T00:00:00Z",
  "abstract": "...",
  "url": "https://arxiv.org/abs/..."
}

Wikipedia — list of:

{
  "topic": "SHA-2",
  "title": "SHA-2",
  "url": "https://en.wikipedia.org/wiki/SHA-2",
  "summary": "..."
}

OEIS — list of:

{
  "id": "A000040",
  "name": "The prime numbers.",
  "description": "...",
  "values": ["2", "3", "5", "7", "11", "..."],
  "url": "https://oeis.org/A000040"
}