Add background info to the README

This commit is contained in:
Anthony Wang 2022-03-01 08:23:09 -06:00
parent 3bc9e434ba
commit 81a13d37fb
Signed by: a
GPG Key ID: BC96B00AEC5F2D76

View File

@ -3,11 +3,36 @@
Fediverse ebooks bot using neural networks
## Background
[Text generation programs](https://en.wikipedia.org/wiki/ELIZA) have existed for decades. However, most [ebooks bots](https://github.com/Lynnesbian/mstdn-ebooks) today rely on Markov chains, which are fast but also produce text that sounds like you pulled some random words out of a hat (which isn't entirely inaccurate in this case).
However, we have another solution. A very overhyped solution, yes. Neural networks! Here are some samples from produced by [this bot](https://social.exozy.me/@ebooks/):
> I toot. I don't want to. I'm happy with it. I like to make people laugh. The only thing I want is to use it to do a wonderful job.
> This is total BS, and there really is no difference between this and the Matrix Matrix Matrix Matrix Matrix.
> Follow me for the next few days. Please remember, I'm sorry, and I'm sorry for the inconvenience.
> @Gargron @gargron @mattkat @craj_chris I am not a lawyer. I am just the voice of God. I'm a non-profit organization and I can be seen in other ways.
As you can see, neural networks can generate much more coherent text and learn how to use mentions and hashtags. The caveat? It takes *only* a few hours to train the network, and text generation takes a few seconds, compared to Markov chains which can do that almost instaneously.
This bot consists of three components:
- The `data.py` script accesses the fediverse server's database and retrieves all messages to form the training data for the neural network.
- The `train.py` script downloads a pre-trained [DistilGPT2](https://huggingface.co/distilgpt2) (you can use a larger model like GPT-J if your hardware is powerful enough) and [fine-tunes](https://huggingface.co/docs/transformers/training) it on the training data from the database. By using a pre-trained model, the bot will already have a wide variety of knowledge about topics, including ones not even mentioned in the training data. This step takes a long time but you only have to do it once.
- The `bot.py` script uses the fine-tuned model to generate text, and posts it to your fediverse server.
## Usage
First, install Python dependencies using your distro's package manager or `pip`: [psycopg2](https://www.psycopg.org), [torch](https://pytorch.org/), [transformers](https://huggingface.co/docs/transformers/index), and [datasets](https://huggingface.co/docs/datasets/). Additionally, for Mastodon and Pleroma, install [Mastodon.py](https://mastodonpy.readthedocs.io/en/stable/), for Misskey, install [Misskey.py](https://misskeypy.readthedocs.io/ja/latest/), and for Matrix, install [simplematrixbotlib](https://simple-matrix-bot-lib.readthedocs.io/en/latest/index.html). If your database or platform isn't supported, don't worry! It's easy to add support for other platforms and databases, and contributions are welcome!
Now generate the training data from your fediverse server's database using `python data.py -d 'dbname=test user=postgres password=secret host=localhost port=5432'`. Generating the training data from the database is not yet supported for Matrix. You can skip this step if you have collected training data from another source.
Now retrieve the training data from your fediverse server's database using `python data.py -d 'dbname=test user=postgres password=secret host=localhost port=5432'`. Retrieving the training data from the database is not yet supported for Matrix. You can skip this step if you have collected training data from another source.
Next, train the network with `python train.py`, which may take several hours. It's a lot faster when using a GPU. If you need advanced features when training, you can also train using [run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py).