Files
simulation-theory/scrapers/README.md
2026-02-25 18:20:10 +00:00

95 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Scrapers
Python web scrapers for collecting data relevant to the simulation-theory research repository.
## Scrapers
| Script | Source | Topics |
|--------|--------|--------|
| [`arxiv_scraper.py`](./arxiv_scraper.py) | [arXiv](https://arxiv.org) | Simulation hypothesis, Gödel incompleteness, Riemann zeta, qutrit/ternary quantum, halting problem, IIT consciousness |
| [`wikipedia_scraper.py`](./wikipedia_scraper.py) | [Wikipedia](https://en.wikipedia.org) | SHA-256, Riemann hypothesis, quantum computing, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain |
| [`oeis_scraper.py`](./oeis_scraper.py) | [OEIS](https://oeis.org) | Prime numbers, Fibonacci, pi digits, EulerMascheroni constant, Catalan numbers, partition numbers |
## Setup
```bash
pip install -r requirements.txt
```
## Usage
### arXiv scraper
```bash
# Use default topic list
python arxiv_scraper.py
# Custom query, limit to 3 results per query
python arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3
# Save to file
python arxiv_scraper.py --output arxiv_results.json
```
### Wikipedia scraper
```bash
# Use default topic list
python wikipedia_scraper.py
# Custom topics
python wikipedia_scraper.py --topics "Riemann hypothesis" "SHA-2" "Turing machine"
# Save to file
python wikipedia_scraper.py --output wikipedia_results.json
```
### OEIS scraper
```bash
# Use default sequence list
python oeis_scraper.py
# Custom sequence IDs
python oeis_scraper.py --ids A000040 A000045 A000796
# Save to file
python oeis_scraper.py --output oeis_results.json
```
## Output format
All scrapers output JSON to stdout by default, or to a file with `--output`.
**arXiv** — dict keyed by query, each value is a list of:
```json
{
"title": "...",
"authors": ["..."],
"published": "2024-01-01T00:00:00Z",
"abstract": "...",
"url": "https://arxiv.org/abs/..."
}
```
**Wikipedia** — list of:
```json
{
"topic": "SHA-2",
"title": "SHA-2",
"url": "https://en.wikipedia.org/wiki/SHA-2",
"summary": "..."
}
```
**OEIS** — list of:
```json
{
"id": "A000040",
"name": "The prime numbers.",
"description": "...",
"values": ["2", "3", "5", "7", "11", "..."],
"url": "https://oeis.org/A000040"
}
```