You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
1 month ago | |
---|---|---|
data | 1 month ago | |
docker | 1 month ago | |
src | 1 month ago | |
ui/telegram | 1 month ago | |
.dockerignore | 1 month ago | |
.gitignore | 1 month ago | |
Makefile | 1 month ago | |
README.md | 1 month ago | |
dataload.py | 2 months ago | |
docker-compose.yaml | 1 month ago |
README.md
tor-worm
py3 tor crawler
features
- crawls Tor sites
- stores links between sites and their pages
- detects content type / language
- provides language-specific content search
structure
- data/ - various data, e.g. dumps
- docker/ - dockerfiles for services
- src/ - backend sources
- app.py - API server
- config.py - currently some static configurations (TBD: replace with redis-based)
- const.py - constants that are reused in different parts of code
- scheduler.py - singleton scheduler worker code
- schemas.py - API data models
- worker.py - main code, works with API server
- data/ - currently contains pluggable backends for main crawler data
- meta/ - currently contains pluggable backends for pages data
- ui/telegram - TG frontend sources
- .dockerignore, .gitignore, .env - service stuff
- dataload.py - loads links from data/data.csv
- docker-compose.yml - dev (!) deployment configuration for docker swarm
- Makefile -
make all
for building images
components
- API server - provides REST interface for all the data
- Workers
- push - only submits data to API server
- requests - performs queries to Tor network
- cpu - performs cpu-heavy tasks (mostly parsing)
- scheduler - periodically requests new tasks based on
- Data storages
- Arango - stores main information about services and pages, including relations between them
- ES - stores content of pages, used for searching
- Redis - used as task broker for workers and as cache
- Apache Tika - extracts metadata from page content
- TG UI - provides simple search interface for Telegram (through API server)
- Tor proxy - basically tor daemon, acts as http-proxy-gateway to Tor network
- haproxy - loadbalancer (for torproxy and server instances)
deploy
make all && docker stack deploy -c docker-compose.yaml tor-worm
stop && remove
docker stack rm tor-worm