Skip to content

blixten85/scraper

Repository files navigation

Web Scraper Platform

Build Release License: MIT Python PostgreSQL Docker

Production-ready web scraping platform with PostgreSQL, WebUI, REST API, and price monitoring.


Features

Feature Description
Multi-site scraping Scrape any e-commerce site with CSS selectors
Auto-detect Detect selectors automatically from a URL
Stealth mode Bypass Akamai, Cloudflare, and PerimeterX bot protection
WebUI Configure and monitor via web interface
REST API Programmatic data access with API key authentication
Price alerts Discord notifications for price drops
PostgreSQL Production-grade database with connection pooling
Docker Run everything with a single docker compose up
Proxy support Optional SOCKS5/HTTP proxy

Quick Start

# 1. Create a minimal .env file
cat > .env <<'EOF'
DOCKER=/path/to/docker/data
DOMAIN=example.com
TZ=Europe/Stockholm
EOF

# 2. Create required directories
mkdir -p /path/to/docker/data/scraper/{postgres,logs,playwright-cache,credentials}

# 3. (Optional) Add Discord webhook for price alerts
echo "https://discord.com/api/webhooks/..." \
  > /path/to/docker/data/scraper/credentials/discord_webhook

# 4. Start
docker compose up -d

# 5. Check logs for generated credentials (first start only)
docker compose logs postgres   # → database password
docker compose logs scraper    # → API key

# 6. Open WebUI
# http://localhost:3000

Services (Docker)

Container Port Description
postgres 5432 (internal) PostgreSQL database
scraper 3000 Web UI, API, scraper engine, alerts

Credentials

All credentials are auto-generated on first startup and stored in DOCKER/scraper/credentials/:

File Generated by Description
db_password postgres container Database password (logged once on first start)
api_key scraper container API key for REST access (logged once on first start)
discord_webhook you Webhook URL from Discord — create manually if you want alerts

Credentials can be changed at any time under Configuration → Advanced settings → Database credentials in the WebUI.

# Retrieve the API key after first start
cat /path/to/docker/data/scraper/credentials/api_key

API Examples

All endpoints except /health and /docs require an X-API-Key header.

# Get all products
curl -H "X-API-Key: ${API_KEY}" http://localhost:8000/products

# Search products
curl -H "X-API-Key: ${API_KEY}" "http://localhost:8000/products?search=RTX"

# Get price drops
curl -H "X-API-Key: ${API_KEY}" "http://localhost:8000/deals?min_drop_percent=10"

# Export to CSV
curl -H "X-API-Key: ${API_KEY}" http://localhost:8000/export/csv > products.csv

API Documentation: http://localhost:8000/docs


Configuration (.env)

Only three variables are required:

DOCKER=/path/to/docker/data   # where volumes are stored
DOMAIN=example.com             # used for reverse proxy labels
TZ=Europe/Stockholm            # timezone

All other settings (scrape interval, alert thresholds, proxy, stealth, etc.) are configured in the WebUI under Advanced settings and stored in the database.


Optional: Scheduled Database Backups

Add this service to your docker-compose.yml for automatic daily pg_dump backups (kept 7 days):

  pgdump:
    image: postgres:16-alpine
    container_name: scraper_pgdump
    restart: unless-stopped
    entrypoint: ["/bin/sh", "-c"]
    command: |
      while true; do
        PGPASSWORD=$(cat /run/secrets/scraper_password)
        pg_dump -h postgres -U scraper scraper -Fc \
          -f "/backup/scraper_$(date +%Y%m%d_%H%M).dump" \
          && find /backup -name '*.dump' -mtime +7 -delete \
          && echo "[$(date '+%T')] pg_dump ok" \
          || echo "[$(date '+%T')] pg_dump failed" && sleep 3600 && continue
        sleep 86400
      done
    secrets:
      - scraper_password
    volumes:
      - ${DOCKER}/scraper/backup:/backup
    depends_on:
      postgres:
        condition: service_healthy
    logging:
      driver: json-file
      options:
        max-size: "5m"
        max-file: "2"

Troubleshooting

Postgres won't start

sudo chown -R 999:999 ${DOCKER}/scraper/postgres

API returns 401 Unauthorized

curl -H "X-API-Key: ${API_KEY}" http://localhost:8000/products

No products are scraped

# Test selectors via WebUI (Detect button)
docker compose logs scraper --tail 50

Database Schema

products (
  id SERIAL PRIMARY KEY,
  url TEXT UNIQUE,
  title TEXT,
  current_price INTEGER,
  first_seen TIMESTAMP,
  last_updated TIMESTAMP,
  site_config_id INTEGER
)

price_history (
  id SERIAL PRIMARY KEY,
  product_id INTEGER REFERENCES products(id),
  price INTEGER,
  timestamp TIMESTAMP
)

scraper_config (
  id SERIAL PRIMARY KEY,
  name TEXT UNIQUE,
  base_url TEXT,
  product_selector TEXT,
  title_selector TEXT,
  price_selector TEXT,
  link_selector TEXT,
  enabled INTEGER DEFAULT 1,
  use_stealth INTEGER DEFAULT 0,
  max_pages INTEGER DEFAULT 10,
  min_price INTEGER DEFAULT 0,
  max_price INTEGER DEFAULT 999999
)

License

MIT - see LICENSE

About

Webshop scraper

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors