parent
5f2ddd562a
commit
181f45d92f
@ -0,0 +1,12 @@
|
||||
This is a simple web scraper in Python using [scrapy](https://docs.scrapy.org/) that writes all the markdown from https://basement.woodbine.nyc/ to disk.
|
||||
|
||||
Appending `/download` to the end of any hedgedoc page url will return a text file with the markdown. The scraper starts at the markdown version of the homepage and scrapes `[text](hyperlink)` style markdown links. If there are wiki pages that are not linked to from anywhere else this script will not find them.
|
||||
|
||||
Run like this:
|
||||
|
||||
$ python -m .venv venv
|
||||
$ source .venv/bin/activate
|
||||
$ pip install -r requirements
|
||||
$ scrapy crawl pages
|
||||
|
||||
The markdown output will appear in the `None/basement.woodbine.nyc` directory.
|
@ -0,0 +1,37 @@
|
||||
attrs==24.3.0
|
||||
Automat==24.8.1
|
||||
certifi==2024.12.14
|
||||
cffi==1.17.1
|
||||
charset-normalizer==3.4.1
|
||||
constantly==23.10.4
|
||||
cryptography==44.0.0
|
||||
cssselect==1.2.0
|
||||
defusedxml==0.7.1
|
||||
filelock==3.17.0
|
||||
hyperlink==21.0.0
|
||||
idna==3.10
|
||||
incremental==24.7.2
|
||||
itemadapter==0.10.0
|
||||
itemloaders==1.3.2
|
||||
jmespath==1.0.1
|
||||
lxml==5.3.0
|
||||
packaging==24.2
|
||||
parsel==1.10.0
|
||||
Protego==0.4.0
|
||||
pyasn1==0.6.1
|
||||
pyasn1_modules==0.4.1
|
||||
pycparser==2.22
|
||||
PyDispatcher==2.0.7
|
||||
pyOpenSSL==25.0.0
|
||||
queuelib==1.7.0
|
||||
requests==2.32.3
|
||||
requests-file==2.1.0
|
||||
Scrapy==2.12.0
|
||||
service-identity==24.2.0
|
||||
setuptools==75.8.0
|
||||
tldextract==5.1.3
|
||||
Twisted==24.11.0
|
||||
typing_extensions==4.12.2
|
||||
urllib3==2.3.0
|
||||
w3lib==2.2.1
|
||||
zope.interface==7.2
|
Loading…
Reference in new issue