You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
dobbry vechur bfd8ca1e41 update readme 1 month ago
data new arango 1 month ago
docker new arango 1 month ago
src new arango 1 month ago
ui/telegram new arango 1 month ago
.dockerignore new arango 1 month ago
.gitignore new arango 1 month ago
Makefile new arango 1 month ago
README.md update readme 1 month ago
dataload.py arango 2 months ago
docker-compose.yaml new arango 1 month ago

README.md

tor-worm

py3 tor crawler

features

  • crawls Tor sites
  • stores links between sites and their pages
  • detects content type / language
  • provides language-specific content search

structure

  • data/ - various data, e.g. dumps
  • docker/ - dockerfiles for services
  • src/ - backend sources
    • app.py - API server
    • config.py - currently some static configurations (TBD: replace with redis-based)
    • const.py - constants that are reused in different parts of code
    • scheduler.py - singleton scheduler worker code
    • schemas.py - API data models
    • worker.py - main code, works with API server
    • data/ - currently contains pluggable backends for main crawler data
    • meta/ - currently contains pluggable backends for pages data
  • ui/telegram - TG frontend sources
  • .dockerignore, .gitignore, .env - service stuff
  • dataload.py - loads links from data/data.csv
  • docker-compose.yml - dev (!) deployment configuration for docker swarm
  • Makefile - make all for building images

components

  • API server - provides REST interface for all the data
  • Workers
    • push - only submits data to API server
    • requests - performs queries to Tor network
    • cpu - performs cpu-heavy tasks (mostly parsing)
    • scheduler - periodically requests new tasks based on
  • Data storages
    • Arango - stores main information about services and pages, including relations between them
    • ES - stores content of pages, used for searching
    • Redis - used as task broker for workers and as cache
  • Apache Tika - extracts metadata from page content
  • TG UI - provides simple search interface for Telegram (through API server)
  • Tor proxy - basically tor daemon, acts as http-proxy-gateway to Tor network
  • haproxy - loadbalancer (for torproxy and server instances)

deploy

make all && docker stack deploy -c docker-compose.yaml tor-worm

stop && remove

docker stack rm tor-worm