mirror of
https://github.com/blackboxprogramming/simulation-theory.git
synced 2026-03-17 05:57:19 -05:00
Add scrapers for arXiv, Wikipedia, and OEIS
Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
This commit is contained in:
94
scrapers/README.md
Normal file
94
scrapers/README.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Scrapers
|
||||
|
||||
Python web scrapers for collecting data relevant to the simulation-theory research repository.
|
||||
|
||||
## Scrapers
|
||||
|
||||
| Script | Source | Topics |
|
||||
|--------|--------|--------|
|
||||
| [`arxiv_scraper.py`](./arxiv_scraper.py) | [arXiv](https://arxiv.org) | Simulation hypothesis, Gödel incompleteness, Riemann zeta, qutrit/ternary quantum, halting problem, IIT consciousness |
|
||||
| [`wikipedia_scraper.py`](./wikipedia_scraper.py) | [Wikipedia](https://en.wikipedia.org) | SHA-256, Riemann hypothesis, quantum computing, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain |
|
||||
| [`oeis_scraper.py`](./oeis_scraper.py) | [OEIS](https://oeis.org) | Prime numbers, Fibonacci, pi digits, Euler–Mascheroni constant, Catalan numbers, partition numbers |
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### arXiv scraper
|
||||
|
||||
```bash
|
||||
# Use default topic list
|
||||
python arxiv_scraper.py
|
||||
|
||||
# Custom query, limit to 3 results per query
|
||||
python arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3
|
||||
|
||||
# Save to file
|
||||
python arxiv_scraper.py --output arxiv_results.json
|
||||
```
|
||||
|
||||
### Wikipedia scraper
|
||||
|
||||
```bash
|
||||
# Use default topic list
|
||||
python wikipedia_scraper.py
|
||||
|
||||
# Custom topics
|
||||
python wikipedia_scraper.py --topics "Riemann hypothesis" "SHA-2" "Turing machine"
|
||||
|
||||
# Save to file
|
||||
python wikipedia_scraper.py --output wikipedia_results.json
|
||||
```
|
||||
|
||||
### OEIS scraper
|
||||
|
||||
```bash
|
||||
# Use default sequence list
|
||||
python oeis_scraper.py
|
||||
|
||||
# Custom sequence IDs
|
||||
python oeis_scraper.py --ids A000040 A000045 A000796
|
||||
|
||||
# Save to file
|
||||
python oeis_scraper.py --output oeis_results.json
|
||||
```
|
||||
|
||||
## Output format
|
||||
|
||||
All scrapers output JSON to stdout by default, or to a file with `--output`.
|
||||
|
||||
**arXiv** — dict keyed by query, each value is a list of:
|
||||
```json
|
||||
{
|
||||
"title": "...",
|
||||
"authors": ["..."],
|
||||
"published": "2024-01-01T00:00:00Z",
|
||||
"abstract": "...",
|
||||
"url": "https://arxiv.org/abs/..."
|
||||
}
|
||||
```
|
||||
|
||||
**Wikipedia** — list of:
|
||||
```json
|
||||
{
|
||||
"topic": "SHA-2",
|
||||
"title": "SHA-2",
|
||||
"url": "https://en.wikipedia.org/wiki/SHA-2",
|
||||
"summary": "..."
|
||||
}
|
||||
```
|
||||
|
||||
**OEIS** — list of:
|
||||
```json
|
||||
{
|
||||
"id": "A000040",
|
||||
"name": "The prime numbers.",
|
||||
"description": "...",
|
||||
"values": ["2", "3", "5", "7", "11", "..."],
|
||||
"url": "https://oeis.org/A000040"
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user