ebooks/README.md

53 lines
4 KiB
Markdown
Raw Normal View History

2022-02-21 00:58:25 +00:00
# ebooks
Fediverse ebooks bot using neural networks
2022-02-21 18:39:37 +00:00
2022-02-28 00:00:22 +00:00
2022-03-01 14:23:09 +00:00
## Background
[Text generation programs](https://en.wikipedia.org/wiki/ELIZA) have existed for decades. However, most [ebooks bots](https://github.com/Lynnesbian/mstdn-ebooks) today rely on Markov chains, which are fast but also produce text that sounds like you pulled some random words out of a hat (which isn't entirely inaccurate in this case).
However, we have another solution. A very overhyped solution, yes. Neural networks! Here are some samples from produced by [this bot](https://social.exozy.me/@ebooks/):
> I toot. I don't want to. I'm happy with it. I like to make people laugh. The only thing I want is to use it to do a wonderful job.
> This is total BS, and there really is no difference between this and the Matrix Matrix Matrix Matrix Matrix.
> Follow me for the next few days. Please remember, I'm sorry, and I'm sorry for the inconvenience.
> @Gargron @gargron @mattkat @craj_chris I am not a lawyer. I am just the voice of God. I'm a non-profit organization and I can be seen in other ways.
As you can see, neural networks can generate much more coherent text and learn how to use mentions and hashtags. The caveat? It takes *only* a few hours to train the network, and text generation takes a few seconds, compared to Markov chains which can do that almost instaneously.
This bot consists of three components:
- The `data.py` script accesses the fediverse server's database and retrieves all messages to form the training data for the neural network.
- The `train.py` script downloads a pre-trained [DistilGPT2](https://huggingface.co/distilgpt2) (you can use a larger model like GPT-J if your hardware is powerful enough) and [fine-tunes](https://huggingface.co/docs/transformers/training) it on the training data from the database. By using a pre-trained model, the bot will already have a wide variety of knowledge about topics, including ones not even mentioned in the training data. This step takes a long time but you only have to do it once.
- The `bot.py` script uses the fine-tuned model to generate text, and posts it to your fediverse server.
2022-02-21 18:39:37 +00:00
## Usage
2022-03-01 04:23:05 +00:00
First, install Python dependencies using your distro's package manager or `pip`: [psycopg2](https://www.psycopg.org), [torch](https://pytorch.org/), [transformers](https://huggingface.co/docs/transformers/index), and [datasets](https://huggingface.co/docs/datasets/). Additionally, for Mastodon and Pleroma, install [Mastodon.py](https://mastodonpy.readthedocs.io/en/stable/), for Misskey, install [Misskey.py](https://misskeypy.readthedocs.io/ja/latest/), and for Matrix, install [simplematrixbotlib](https://simple-matrix-bot-lib.readthedocs.io/en/latest/index.html). If your database or platform isn't supported, don't worry! It's easy to add support for other platforms and databases, and contributions are welcome!
2022-03-01 14:23:09 +00:00
Now retrieve the training data from your fediverse server's database using `python data.py -d 'dbname=test user=postgres password=secret host=localhost port=5432'`. Retrieving the training data from the database is not yet supported for Matrix. You can skip this step if you have collected training data from another source.
2022-02-21 18:39:37 +00:00
2022-02-23 16:09:31 +00:00
Next, train the network with `python train.py`, which may take several hours. It's a lot faster when using a GPU. If you need advanced features when training, you can also train using [run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py).
2022-02-21 18:39:37 +00:00
2022-02-28 00:00:22 +00:00
Finally, create an application for your bot account and generate an access token. Run the bot with `python bot.py -b server_type -i fediverse.instance -t access_token`. You can omit `-b server_type` for Mastodon and Pleroma. To run the bot periodically, create a cron job. Enjoy!
## Resources
- https://closeheat.com/blog/pytorch-lstm-text-generation-tutorial
- https://trungtran.io/2019/02/08/text-generation-with-pytorch/
- https://huggingface.co/docs/transformers/training
- https://huggingface.co/blog/how-to-generate
2022-02-22 23:42:39 +00:00
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling