initial commit

2022-11-30 12:55:40 +01:00 · 2022-11-30 12:55:40 +01:00 · c54df77319
commit c54df77319
3 changed files with 171 additions and 0 deletions
--- a/WIP/ARCHITECTURE.txt
+++ b/WIP/ARCHITECTURE.txt
@ -0,0 +1,44 @@
+mutex for distribution + packet spooler
+
+in the elixir version, this uses one process for each of them,
+but that lead to race conditions and dead locks between these parts...
+
+ok, what do I want
+* spooler should use flat files, split into multiple files when they get too big
+* no TLS, it just makes stuff more complicated, and we also can't trust any nodes in the network,
+  besides those originating/producing messages, which have pubkeys + signatures, which we can manage through an allowlist
+* it might be the case that there exist clusters of servers which trust each other,
+  but they probably will be connected by a trusted network, so no need for TLS there either.
+* might make sense to split sockets into read, write parts
+* use 2 threads per connection
+* worker threadpool for filtering
+
+* 16bit message length fields, makes abuse and such much harder (e.g. problematic stuff are mostly images and large binaries,
+  think mostly copyright bs, porn, etc., maybe malware sharing. don't want any of that,
+  even if parts would be nice, because moderating that is too much of a hassle, and really exhausting)
+  max 64KiB messages have another advantage: they make it much harder to completely DDOS the network
+  and exhaust the RAM and disk-space of participating servers.
+* need multiple channels, due to feedback loop
+* use non-async rust, we can use `sendfile(2)`! via std::io::copy!
+* shard messages according to first bytes of signature
+
+Channels between threads:
+* initially, from spool -> conn/write
+  * "send :ihave for all entries in the spooler"
+
+* conn/read -> filterpool
+  * "received message"
+* conn/read -> conn/write
+  * "got :ihave, send :pull"
+* filterpool -> conn/write
+  * "send :ihave for latest slice for recently changed pubkey (multiple such messages should be coalesced)"
+  * "distribute message" via sendfile
+
+no coordination should be necessary for closed read/write channels
+(the appropriate threads just terminate), but it is necessary to have
+a way to schedule a re-connection after both terminate. So we need a
+central list of configured peers, where the associated threads are linked
+via AtomicUsize + event-listener::Event
+
+the filterpool acquires a mutex for writing to a shard
+(allocate an array with 2^16 slots for that)
--- a/WIP/PROTO.txt
+++ b/WIP/PROTO.txt
@ -0,0 +1,72 @@
+the Elixir version uses an ASN.1 based format, but the support in other
+languages (besides C) for ASN.1 is surprisingly bad.
+So I'll resort to a hard-coded format. Again.
+
+----
+
+InitMessage ::=
+    capabilities:ShortList<u32(big-endian)>
+
+(* the InitMessage is the first thing sent on a newly established connection.
+ * both sides are expected to send the capabilities they support
+ * then wait for the other side to send theirs (meaning they don't
+ * block each other during this), and then use just the capabilities
+ * which both peers support *)
+
+ProtoMessage ::=
+    algorithm:SigAlgo
+    pubkeylen:u16(big-endian)
+    length:u16(big-endian)
+    data:[length]u8=ProtoMsgKind<pubkeylen>
+
+ProtoMsgKind<pubkeylen> ::=
+    (* xfer           *) [0x00] XferSlice<pubkeylen>
+  | (* summary-pull   *) [0x01] SummaryList<pubkeylen>
+  | (* summary-ihave  *) [0x02] SummaryList<pubkeylen>
+
+(* after the initial capabilities are sent, both sides are expected to send
+ * summaries for all public keys they know + all available slices for these
+ * pubkeys                                                                  *)
+
+ShortList<T> ::= length:u16(big-endian) data:[length]T
+XSList<T> ::= length:u8 data:[length]T
+
+(* "raw" summary header overhead = 41 bytes, per 65535 slices *)
+SummaryList<pubkeylen> ::=
+    data:[*]Slice<pubkeylen> (* uses all remaining space in the packet *)
+
+Slice<pubkeylen> ::=
+    pubkey:[pubkeylen]u8
+    start:u64
+    length:u16
+
+XferSlice<pubkeylen> ::=
+    pubkey:[pubkeylen]u8
+    siglen:u16(big-endian)
+    start:u64
+    data:[*]XferBlob<siglen> (* uses all remaining space in the packet *)
+
+XferBlob<siglen> ::=
+    signature:[siglen]u8 (* signature gets calculated over `blid ++ data` *)
+    ttl:u8
+    data:XferInner
+
+XferInner ::=
+    (* normal *) [0x00] XferNormal
+  | (* prune  *) [0x01] (* empty; NOTE: this packet indicates that
+     * all packets before it should be deleted on all servers;
+     * used for data protection/privacy compliance;
+     * although it is possible to delete such packets after a grace period
+     * (e.g. 1 year) during which every peer should've had a chance to retrieve
+     * it, doing that is discouraged *)
+
+XferNormal ::=
+    attrs:ShortList<KeyValuePair>
+    data:ShortList<u8>
+
+KeyValuePair ::=
+    key:XSList<u8>
+    value:ShortList<u8>
+
+SigAlgo ::=
+    (* ed25519   *) [0x65, 0x64, 0xff, 0x13]
--- a/WIP/SPOOLER.txt
+++ b/WIP/SPOOLER.txt
@ -0,0 +1,55 @@
+spool file organization:
+
+the directory tree looks like this:
+  {SigAlgo:04x}/{:02x}/{:02x}/...
+the first level is the SigAlgo identifier of the signature algorithm in use.
+the second and third levels are the first 2 bytes of the public key of a party,
+encoded as hexadecimal.
+
+the public keys have all the same length, and get encoded as urlsafe base64.
+
+## per public keys ...
+
+... there can be a bunch of associated files.
+- {pubkey}.{chunkid}.slcs contains slice listings (* =SLCS *)
+- {pubkey}.{chunkid}.data contains the actual data (* =DataS *)
+- {pubkey}.lock is a lock file for the public key, which gets used to prevent overlapping writes and GCs...
+
+<proto>
+
+SLCS ::= [*]SlicePtr
+
+SlicePtr ::= (* 16 bytes per entry *)
+    (* slice boundaries *)
+    slice:Slice
+    (* data boundaries *)
+    dptr:Pointer
+
+Slice ::=
+    start:u64(big-endian)
+    length:u16(big-endian)
+
+Pointer ::=
+    start:u32(big-endian)
+    length:u16(big-endian)
+
+DataS ::=
+    (* @siglen is inferred from the used SigAlgo *)
+    [*]XferBlob<@siglen>
+
+</proto>
+
+... for compaction, the corresponding pubkey gets locked,
+- new temporary files are created in the corresponding directory,
+- the slices get sorted, and for each blob the length gets calculated,
+- summed up starting from the newest blob, going reverse,
+- until we hit the maximum size per pubkey
+  (usually available storage space * 0.8 divided by the amount of known pubkeys)
+- then we cut off the remaining, not yet processed blobs
+- and start now from the first kept blob going forward
+- write all of them to a new data file, and create a corresponding slice listing file
+- note that adjacent slices in the slice listing file also get merged
+
+the compaction algorithm should run roughly once every 15 minutes, but only visit
+pubkeys to which data has been appended to in that timespan.
+