A simple web scraper that crawls https://basement.woodbine.nyc/ and dumps all linked internal pages to markdown files on disk.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Paul Feitzinger 181f45d92f
add readme and requirements file
3 months ago
None/basement.woodbine.nyc add scraped pages 3 months ago
hedgedoc_exporter scrape all markdown from hedgedoc wiki 3 months ago
README.md add readme and requirements file 3 months ago
requirements.txt add readme and requirements file 3 months ago
scrapy.cfg initialize scrapy project 3 months ago

README.md

This is a simple web scraper in Python using scrapy that writes all the markdown from https://basement.woodbine.nyc/ to disk.

Appending /download to the end of any hedgedoc page url will return a text file with the markdown. The scraper starts at the markdown version of the homepage and scrapes [text](hyperlink) style markdown links. If there are wiki pages that are not linked to from anywhere else this script will not find them.

Run like this:

$ python -m .venv venv
$ source .venv/bin/activate
$ pip install -r requirements
$ scrapy crawl pages

The markdown output will appear in the None/basement.woodbine.nyc directory.