add readme and requirements file

3 months ago · 181f45d92f
parent 5f2ddd562a
commit 181f45d92f
2 changed files with 49 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,12 @@
 This is a simple web scraper in Python using [scrapy](https://docs.scrapy.org/) that writes all the markdown from https://basement.woodbine.nyc/ to disk.
 Appending `/download` to the end of any hedgedoc page url will return a text file with the markdown. The scraper starts at the markdown version of the homepage and scrapes `[text](hyperlink)` style markdown links. If there are wiki pages that are not linked to from anywhere else this script will not find them.
 Run like this:
    $ python -m .venv venv
    $ source .venv/bin/activate
    $ pip install -r requirements
    $ scrapy crawl pages
 The markdown output will appear in the `None/basement.woodbine.nyc` directory.
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,37 @@
 attrs==24.3.0
 Automat==24.8.1
 certifi==2024.12.14
 cffi==1.17.1
 charset-normalizer==3.4.1
 constantly==23.10.4
 cryptography==44.0.0
 cssselect==1.2.0
 defusedxml==0.7.1
 filelock==3.17.0
 hyperlink==21.0.0
 idna==3.10
 incremental==24.7.2
 itemadapter==0.10.0
 itemloaders==1.3.2
 jmespath==1.0.1
 lxml==5.3.0
 packaging==24.2
 parsel==1.10.0
 Protego==0.4.0
 pyasn1==0.6.1
 pyasn1_modules==0.4.1
 pycparser==2.22
 PyDispatcher==2.0.7
 pyOpenSSL==25.0.0
 queuelib==1.7.0
 requests==2.32.3
 requests-file==2.1.0
 Scrapy==2.12.0
 service-identity==24.2.0
 setuptools==75.8.0
 tldextract==5.1.3
 Twisted==24.11.0
 typing_extensions==4.12.2
 urllib3==2.3.0
 w3lib==2.2.1
 zope.interface==7.2