Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
### v1.0.0
```
added flag "-text-match" to filter page text matches
memory and performance optimizations for -file and -url modes
-file mode streams wordlists from disk instead of loading entire files into RAM
reduced RAM usage for large -sort wordlists
default -timeout increased from 1 to 10 seconds
progress bars, stats, and errors now write to stderr
sanitize url fragments for dedup and extension checks
updated default User-Agent
```
### v0.9.1
```
added flag "-agent" to allow user to specify custom user-agent; https://github.com/cyclone-github/spider/issues/8
Expand Down
117 changes: 77 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,78 +9,122 @@
Spider is a web crawler and wordlist/ngram generator written in Go that crawls specified URLs or local files to produce frequency-sorted wordlists and ngrams. Users can customize crawl depth, output files, frequency sort, and ngram options, making it ideal for web scraping to create targeted wordlists for tools like hashcat or John the Ripper. Since Spider is written in Go, it requires no additional libraries to download or install.
*Spider just works*.

### Install latest release:
```
go install github.com/cyclone-github/spider@latest
```
### Install from latest source code (bleeding edge):
```
go install github.com/cyclone-github/spider@main
```

### Modes
- **URL mode** (`-url`) — crawl a website and create wordlist/ngrams (frequency sorted optional)
- **File mode** (`-file`) — process a local text file to create wordlist/ngrams (frequency sorted optional)

# Spider: URL Mode
```
spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt -agent 'foobar agent'
spider -url 'https://github.com/hashpwn' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 10 -url-match 'hashpwn' -text-match 'hashpwn' -o hashpwn_spider.txt -agent 'foobar agent'
```
```
----------------------
| Cyclone's URL Spider |
----------------------
------------------
| Cyclone's Spider |
------------------

Crawling URL: https://forum.hashpwn.net
Base domain: forum.hashpwn.net
Crawling URL: https://github.com/hashpwn
Base domain: github.com
Crawl depth: 2
ngram len: 1-3
Crawl delay: 20ms (increase this to avoid rate limiting)
Timeout: 1 sec
URLs crawled: 2
Processing... [====================] 100.00%
Unique words: 475
Unique ngrams: 1977
Sorting n-grams by frequency...
Crawl delay: 20ms (increase to avoid rate limiting)
Timeout: 10 sec
URL match: hashpwn
Text match: hashpwn
Scan/match: 22/21
Unique words: 847
Unique ngrams: 3816
Sorting wordlist by frequency...
Writing... [====================] 100.00%
Output file: forum.hashpwn.net_spider.txt
RAM used: 0.02 GB
Runtime: 2.283s
Output file: hashpwn_spider.txt
RAM used: 0.003 GB
Runtime: 11.279s
```

When `-text-match` is used, all pages are still crawled for URL discovery but only pages with matching text are added to the wordlist. Crawl progress shows scanned vs matched:
```
spider -url 'https://en.wikipedia.org/wiki/PBKDF2' -crawl 2 -sort -text-match 'pbkdf2' -delay 10 -o pbkdf2_spider.txt
```
```
------------------
| Cyclone's Spider |
------------------

Crawling URL: https://en.wikipedia.org/wiki/PBKDF2
Base domain: en.wikipedia.org
Crawl depth: 2
ngram len: 1
Crawl delay: 10ms (increase to avoid rate limiting)
Timeout: 10 sec
Text match: pbkdf2
Scan/match: 213/114
Unique words: 34539
Unique ngrams: 34539
Sorting wordlist by frequency...
Writing... [====================] 100.00%
Output file: pbkdf2_spider.txt
RAM used: 0.012 GB
Runtime: 13.715s
```

# Spider: File Mode
```
spider -file kjv_bible.txt -sort -ngram 1-3
```
```
----------------------
| Cyclone's URL Spider |
----------------------
------------------
| Cyclone's Spider |
------------------

Reading file: kjv_bible.txt
ngram len: 1-3
Processing... [====================] 100.00%
Unique words: 35412
Unique ngrams: 877394
Sorting n-grams by frequency...
Sorting wordlist by frequency...
Writing... [====================] 100.00%
Output file: kjv_bible_spider.txt
RAM used: 0.13 GB
Runtime: 1.359s
RAM used: 0.073 GB
Runtime: 1.137s
```

Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).
### Usage Instructions:
- To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
- To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt if `-o` is not set):
- `spider -url 'https://github.com/cyclone-github'`
- To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
- To set url crawl depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
- `spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5`
- To set a custom output file, use flag "-o filename"
- `spider -url 'https://github.com/cyclone-github' -o wordlist.txt`
- To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
- `spider -url 'https://github.com/cyclone-github' -delay 100`
- To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
- `spider -url 'https://github.com/cyclone-github' -timeout 2`
- To set a URL timeout, use flag "-timeout nth" where nth is time in seconds (default 10)
- `spider -url 'https://github.com/cyclone-github' -timeout 10`
- To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
- `spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort`
- To filter crawled URLs by keyword "foobar"
- `spider -url 'https://github.com/cyclone-github' -url-match foobar`
- To filter crawled URLs by keyword "spider" (only follow/crawl matching URLs)
- `spider -url 'https://github.com/cyclone-github' -url-match 'spider'`
- Only match pages containing text keyword (all URLs are still crawled, but only pages containing keyword are added to wordlist)
- `spider -url 'https://en.wikipedia.org/wiki/PBKDF2' -text-match 'pbkdf2'`
- To specify a custom user-agent
- `spider -url 'https://github.com/cyclone-github' -agent 'foobar'`
- `spider -url 'https://github.com/cyclone-github' -agent 'foobar user agent'`
- To process a local text file, create ngrams len 1-3 and sort output by frequency
- `spider -file foobar.txt -ngram 1-3 -sort`
- Run `spider -help` to see a list of all options

### spider -help
```
Usage of spider:
-agent string
Custom user-agent (default "Spider/0.9.1 (+https://github.com/cyclone-github/spider)")
Custom user-agent (default "Mozilla/5.0 (X11) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 Spider/1.0.0")
-crawl int
Depth of links to crawl (default 1)
-cyclone
Expand All @@ -95,8 +139,10 @@ Wordlist & ngram creation tool to crawl a given url or process a local file to c
Output file for the n-grams
-sort
Sort output by frequency
-text-match string
Only process pages with text containing this keyword (case-insensitive); all URLs are still crawled
-timeout int
Timeout for URL crawling in seconds (default 1)
Timeout for URL crawling in seconds (default 10)
-url string
URL of the website to scrape
-url-match string
Expand All @@ -105,15 +151,6 @@ Wordlist & ngram creation tool to crawl a given url or process a local file to c
Display version
```

### Install latest release:
```
go install github.com/cyclone-github/spider@latest
```
### Install from latest source code (bleeding edge):
```
go install github.com/cyclone-github/spider@main
```

### Compile from source:
- If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
- This assumes you have Go and Git installed
Expand Down
4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
module github.com/cyclone-github/spider

go 1.25.5
go 1.26.2

require github.com/PuerkitoBio/goquery v1.12.0

require (
github.com/andybalholm/cascadia v1.3.3 // indirect
golang.org/x/net v0.52.0 // indirect
golang.org/x/net v0.54.0 // indirect
)
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ golang.org/x/net v0.15.0/go.mod h1:idbUs1IY1+zTqbi8yxTbhexhEEk5ur9LInksu6HrEpk=
golang.org/x/net v0.21.0/go.mod h1:bIjVDfnllIU7BJ2DNgfnXvpSvtn8VRwhlsaeUTyUS44=
golang.org/x/net v0.25.0/go.mod h1:JkAGAh7GEvH74S6FOH42FLoXpXbE/aqXSrIQjXgsiwM=
golang.org/x/net v0.33.0/go.mod h1:HXLR5J+9DxmrqMwG9qjGCxZ+zKXxBru04zlTvWlWuN4=
golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0=
golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw=
golang.org/x/net v0.54.0 h1:2zJIZAxAHV/OHCDTCOHAYehQzLfSXuf/5SoL/Dv6w/w=
golang.org/x/net v0.54.0/go.mod h1:Sj4oj8jK6XmHpBZU/zWHw3BV3abl4Kvi+Ut7cQcY+cQ=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
Expand Down
Loading
Loading