parent
5f2ddd562a
commit
181f45d92f
@ -0,0 +1,12 @@
|
|||||||
|
This is a simple web scraper in Python using [scrapy](https://docs.scrapy.org/) that writes all the markdown from https://basement.woodbine.nyc/ to disk.
|
||||||
|
|
||||||
|
Appending `/download` to the end of any hedgedoc page url will return a text file with the markdown. The scraper starts at the markdown version of the homepage and scrapes `[text](hyperlink)` style markdown links. If there are wiki pages that are not linked to from anywhere else this script will not find them.
|
||||||
|
|
||||||
|
Run like this:
|
||||||
|
|
||||||
|
$ python -m .venv venv
|
||||||
|
$ source .venv/bin/activate
|
||||||
|
$ pip install -r requirements
|
||||||
|
$ scrapy crawl pages
|
||||||
|
|
||||||
|
The markdown output will appear in the `None/basement.woodbine.nyc` directory.
|
@ -0,0 +1,37 @@
|
|||||||
|
attrs==24.3.0
|
||||||
|
Automat==24.8.1
|
||||||
|
certifi==2024.12.14
|
||||||
|
cffi==1.17.1
|
||||||
|
charset-normalizer==3.4.1
|
||||||
|
constantly==23.10.4
|
||||||
|
cryptography==44.0.0
|
||||||
|
cssselect==1.2.0
|
||||||
|
defusedxml==0.7.1
|
||||||
|
filelock==3.17.0
|
||||||
|
hyperlink==21.0.0
|
||||||
|
idna==3.10
|
||||||
|
incremental==24.7.2
|
||||||
|
itemadapter==0.10.0
|
||||||
|
itemloaders==1.3.2
|
||||||
|
jmespath==1.0.1
|
||||||
|
lxml==5.3.0
|
||||||
|
packaging==24.2
|
||||||
|
parsel==1.10.0
|
||||||
|
Protego==0.4.0
|
||||||
|
pyasn1==0.6.1
|
||||||
|
pyasn1_modules==0.4.1
|
||||||
|
pycparser==2.22
|
||||||
|
PyDispatcher==2.0.7
|
||||||
|
pyOpenSSL==25.0.0
|
||||||
|
queuelib==1.7.0
|
||||||
|
requests==2.32.3
|
||||||
|
requests-file==2.1.0
|
||||||
|
Scrapy==2.12.0
|
||||||
|
service-identity==24.2.0
|
||||||
|
setuptools==75.8.0
|
||||||
|
tldextract==5.1.3
|
||||||
|
Twisted==24.11.0
|
||||||
|
typing_extensions==4.12.2
|
||||||
|
urllib3==2.3.0
|
||||||
|
w3lib==2.2.1
|
||||||
|
zope.interface==7.2
|
Loading…
Reference in new issue