add readme and requirements file

main
Paul Feitzinger 3 months ago
parent 5f2ddd562a
commit 181f45d92f

@ -0,0 +1,12 @@
This is a simple web scraper in Python using [scrapy](https://docs.scrapy.org/) that writes all the markdown from https://basement.woodbine.nyc/ to disk.
Appending `/download` to the end of any hedgedoc page url will return a text file with the markdown. The scraper starts at the markdown version of the homepage and scrapes `[text](hyperlink)` style markdown links. If there are wiki pages that are not linked to from anywhere else this script will not find them.
Run like this:
$ python -m .venv venv
$ source .venv/bin/activate
$ pip install -r requirements
$ scrapy crawl pages
The markdown output will appear in the `None/basement.woodbine.nyc` directory.

@ -0,0 +1,37 @@
attrs==24.3.0
Automat==24.8.1
certifi==2024.12.14
cffi==1.17.1
charset-normalizer==3.4.1
constantly==23.10.4
cryptography==44.0.0
cssselect==1.2.0
defusedxml==0.7.1
filelock==3.17.0
hyperlink==21.0.0
idna==3.10
incremental==24.7.2
itemadapter==0.10.0
itemloaders==1.3.2
jmespath==1.0.1
lxml==5.3.0
packaging==24.2
parsel==1.10.0
Protego==0.4.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
PyDispatcher==2.0.7
pyOpenSSL==25.0.0
queuelib==1.7.0
requests==2.32.3
requests-file==2.1.0
Scrapy==2.12.0
service-identity==24.2.0
setuptools==75.8.0
tldextract==5.1.3
Twisted==24.11.0
typing_extensions==4.12.2
urllib3==2.3.0
w3lib==2.2.1
zope.interface==7.2
Loading…
Cancel
Save