Distributed systems

There are quite a few definitions of "distributed" and "decentralized" in use, in this note I'm using the following ones:

Centralized
Clients interacting with a single server (either physical or controlled by the same entity).
Decentralized
Clients interacting with multiple servers (controlled by different entities), which often build a federated network.
Distributed
Clients interacting with other clients directly, acting as servers themselves.

A "system" may also mean different things; here I focus on network protocols, on systems of network-connected independent actors.

Distributed systems are useful for various purposes, but the commonly considered and achievable niceties are:

  • No single point of failure.
  • No necessity in a central authority.
  • Potentially good software: more motivation to work on it if it doesn't mean putting time and effort into assisting unethical activities and wasting it once a service is discontinued.

These are mostly shared with federated systems, but take it further.

The common advantages of centralized systems over these seem to be search/discovery, often sort-of-free hosting for end users, greater UI uniformity in some cases, easier/faster introduction of new features.

Usable systems

Actually usable (reliably working, specified, having users and decent software) systems so far are usually federated/decentralized; those can, in principle, be quite close to distributed systems (simply by setting their servers on user machines). So, generally it seems more useful to focus on those if the intention is to get things done: SMTP (email), NNTP (Usenet), XMPP (jabber), and HTTP (World Wide Web) are relatively well-supported, standardized, and usable for various kinds of communication.

Sometimes even centralized but non-commercial projects and services are okay: OpenStreetMap, The Internet Archive, Wikimedia Foundation projects (Wikipedia, Wiktionary, Wikidata, Wikibooks, etc), arXiv, FLOSS projects, possibly LibGen and Sci-Hub (though they infringe copyright), possibly Libera.Chat (but they had issues arising out of centralization, which is why it is not Freenode anymore). As long as they are easy (and legal, and free) to fork and aren't in a position to extort users, centralization can be fine. Conversely, there can be technically distributed systems effectively controlled by a single entity (e.g., a distributed PKI with a single root, or anything legally restricted). While this note is mostly about distributed network protocols, they are neither necessary nor sufficient for a community control over a system, but rather just may be a useful tool to achieve it.

Existing systems

There are quite a few of them; I am going to write mostly about those that work over Internet. There's also the "Distributed computing architecture" Wikipedia category, including thing slike cluster computer, grid computing, etc.

Generic networks

Tor and I2P: both support "hidden services", on top of which many regular protocols can be used, but it is more about privacy (and a bit about routing) than about decentralisation: they provide NAT traversal, encryption, and static addresses. Tor documentation is relatively nice, and there are I2P docs. Tor provides a nice C client, I2P uses Java.

Mesh networks

Some mesh networks, like Telehash, provide routing as well, though advantages for decentralisation seem to be similar to those of Tor and I2P; just better in that they extend it beyond the existing networks, aiming to build more. Telehash documentation is also pretty nice and full of references.

Cjdns (or its name, at least) seems to be relatively well-known, but it relies on node.js. Netsukuku and B.A.T.M.A.N. are two more protocols the names of which are known.

One of the large Wi-Fi mesh networking projects is Freifunk, but apparently it's only widespread in DACH countries.

Those would be nice to get someday, but they would require quite a lot of users to function, and various government restrictions seem to complicate their usage (this varies from jurisdiction to jurisdiction and from year to year, but seems to be pretty bad in Russia in 2018, and even worse by 2023).

And then there are the ones working over Internet, building overlay networks, usually with technologies similar to those used for VPNs (though yet again, in Russia by 2023 they seem to be about to start blocking protocols used for VPNs, with occasional outages/likely testing reported). Yggdrasil is like that. There is an overview of similar mesh networks: "Easily Accessing All Your Stuff with a Zero-Trust Mesh VPN".

IM and other social services

  • Tox implements its own network (DHT, onion routing, NAT traversal, etc), and has some documentation. Works, though not particularly easy to build, and toxic (apparently the primary implementation) ceases to work after a few days here, requiring a restart.
  • Rival Messenger and Bleep are based on Telehash and BitTorrent, respectively. Have not tried those.
  • RetroShare provides a bunch of features, but with a web-based UI, and I gave up on building it.
  • Matrix seems to be getting relatively popular, but uses HTTP APIs, the specification is not available without JS, there are SDKs (I wonder whether it's ever a useful thing to provide an SDK instead of a single documented library; usually it's just additional pain to work with), web-based clients, etc – seems to be pretty unpleasant overall, following poor practices. Though it's federated, not distributed; functionally it's similar to XMPP with a few XEPs included into the core.
  • Ricochet reuses Tor network, its protocol is documented and doesn't seem to be bloated. Unfortunately, it's bundled with GUI, apparently there is no separate library, and it's in C++ anyway, which would make bindings harder if there was one. Probably it wouldn't be that hard to reimplement, or to extract the non-GUI code bits and make C bindings, to get a reusable library.
  • XMPP is nice and is supported relatively widely (with a choice of servers, clients, and libraries), but federated, rather than distributed, though the former may be converted into the latter.
  • Email: likewise, but using it in a distributed fashion wouldn't be interoperable with common deployments in most cases, and some software may assume a federated setting.
  • ActivityPub: federated, replaces OStatus, partially supported by Mastodon (which seems to be getting popular); used for both microblogging and private messaging. RDF-compatible (though awkward JSON-LD is used in Activity Streams), W3 recommendation. Hence good specification, and generally doesn't look too bad, but the specification doesn't include authentication and authorization as of now (January 2018), and the existing implementations seem to be all awkward: rather poor web UIs, languages such as JS. I finally gave Mastodon a try in 2023, as a user; not bad and generally works, but that "RDF compatiiblity" (as opposed to actually using RDF) shows: for instance, to add metadata even within a single instance, the Mastodon Glitch edition appends emojis to textual messages. I hear it is done that way to keep it compatible with the vanilla version. The primary web-based UI is pretty awkward and buggy. Another somewhat popular ActivityPub-based project is Lemmy, a federated link aggregator and forum.
  • Secure Scuttlebutt is akin to RSS or Atom feeds with signed posts, which include hashes of previous ones, but uses a gossip protocol, rather than a fixed address per feed. Perhaps it is more like a VCS repository with signed commits, where posts are only added. But apparently the primary client is in JS and buggy, and it does not seem to be actively developed (as of 2024).
  • Other IMs: there is a nice comparison of privacy-oriented IMs, file sharing services, and social networks on the secushare website.
  • Other social networking tools: there is a wiki comparison of those.

See also: Distributed state and network topologies in chat systems.

File sharing and websites

  • BitTorrent, of course, with Mainline DHT.
  • IPFS seems to be getting, well, maybe not popular, but mentioned here and there. There are papers and it is documented, but the implementations are currently in Go (reference), JS (incomplete), and Python (started). So, that would involve setting the whole Go thing to try, but the IPFS whitepaper looks nice. There is documentation, and a few separate parts (which can be and are isolated into libraries; though would be more helpful if they were actually reusable C libraries), but they still are a part of a single project, which is not small or simple. There's a growing number of projects using it, such as OrbitDB, and then distributed IMs like Berty (though these projects tend to continue the awkward theme of semi-broken websites, Go + JS, poor interoperability and documentation). Though later it was merged with a cryptocurrency.
  • Freenet is a distributed data store, apparently not very interactive. Or maybe it is; it's in Java, and I didn't try it myself.
  • ZeroNet: haven't tried it, and it's in Python, but apparently it's popular enough to at least mention. Apparently it doesn't care much about security (see a HN thread). There are other similar projects (e.g., Beaker Browser), which seem to market slightly disguised WWW as a new invention.
  • HTTP/rsync/Gopher/whatever, possibly over Tor to get fixed addresses.
  • Gnutella: see below.
  • GNUNet: see below.
  • Dat protocol uses small public keys for addressing, and various discovery methods, somewhat similar to using regular file transfer protocols over Tor. The primary implementation is in JS, and the documentation suggests to install it with curl ... | bash. Apparently gets praised for its documentation, most of which is just awkward raster images.

Search

Web crawling

YaCy and a few more (some of which are dead by now) distributed search engines exist. I have only tried YaCy, and it works, though haven't managed to find its technical documentation – so it's not clear how it works.

Other information

These networks include search for files, but by their names, not content-addressable (so they can't be easily verified, which brings additional challenges).

  • Gnutella again: used for file sharing, with query-based search (an unstructured system, as opposed to DHT-based and content-addressable structured ones). Somewhat limited and hardly secure/reliable for search, but seemed to work in practice. The first version used query flooding, while gnutella2 uses a random walk.

Related papers:

Cryptocurrencies

Plenty of those popped up recently. Bitcoin-like ones (usually with a proof of work and block chaining) look like quite a waste of resources (and perhaps a pyramid scheme) to me, though the idea itself is interesting. I was rather interested in "digital cash" payment systems before, but those didn't quite take off so far.

As of 2021, Bitcoin-like cryptocurrencies seem to be eating other distributed projects: many of those are merged with their custom cryptocurrencies, or occasionally piggyback on existing ones, but either way they become more complicated and commercialized. As of 2022, the "crypto" clipping seems to be associated more widely with cryptocurrencies and related technologies than with cryptography in general. But as of 2024, it seems that the hype wave is mostly over, with "AI" (generative stuff) filling up all the hype slots.

General P2P networking tools

GNUnet

Not sure how to classify it, but here are some links: gnunet.org, GNUnet article in Wikipedia, "A Secure and Resilent Communication Infrastructure for Decentralized Networking Applications". Seems promising, but tricky to build, to figure how it all works, and to do anything with it now (a lack of documentation seems to be the primary issue, though probably there are others). Apparently it is also being blocked in Russia by 2024, at least the gnunet.org website is (via TSPU, it seems), which makes it yet harder to debug. Apparently it is easier to setup in a single-user mode, but none of the retrieved bootstrap peer addresses seem to be available. An up-to-date hostlist can be found (having to use some proxying to access lists.gnu.org from Russia, where it is blocked as well), and then bootstrapping works.

Taler and secushare (using PSYC) are getting built on top of it, but it's not clear how's it going, how abandoned or alive it is, etc. Their documentation also seems to be obsolete/outdated/abandoned/incomplete. Update (January 2018): apparently secushare prototype won't be released this year.

libp2p

libp2p apparently provides common primitives needed for peer-to-peer networking in the presence of NATs and other obstructions. At the time of writing there's no C API (so it's only usable from a few languages) and its website is quite broken. At the same time worldwide IPv6 adoption reaches more than 32%, so possibly NATs will disappear before workarounds will become usable.

General tools useful for P2P networking

Many netowrking-related tools can be used for peer-to-peer networking. socat(1) is among particularly flexible tools for relaying, which can be combined with many other Unix tools for ad hoc networking: openssl, gnutls-cli, and netcat for data encryption and transmission, sox, opusenc, rec, play, pw-record, pw-play, ffplay for audio capture, encoding, decoding, and playback.

Generic protocols

There are more or less generic network protocols that may be used, possibly together with Tor, to get working and secure peer-to-peer services.

SSH is quite nice and layered. Apparently its authentication is not designed for distributed systems (such as distributed IMs or file sharing), its connection layer looks rather bloated, and generally it's not particularly simple. Those are small bits of a large protocol, but they seem to make it not quite usable for peer-to-peer communication.

TLS may provide mutual authentication, and there are readily available tools to work with it.

IPsec uses similar to TLS, but is a generally better way to solve the same problems. Individual addresses (which IPv6 should bring) are needed to use it for P2P widely though. IPv6 gets adopted, but slowly. Once computers will become addressable individually (again), and transport layer encryption will be there by default, it may render plenty of the contemporary higher-level network protocols obsolete.

Pretty much every distributed IM tries to reinvent everything, and virtually none are satisfactory, but at least some of the problems are already solved separately: one can use dynamic DNS, Tor, or a VPN to obtain reachable addresses (even if the involved IP addresses change, and/or are behind NAT), and then use any basic/common communication protocol on top. Or even set a VM and rely on SSH access, communicating inside that system then.

Search, FOAF, and the rest of RDF

Some kind of a distributed search/directory may connect small peer-to-peer islands into a usable network. While it is hard to decide on an algorithm, lists of known and/or somewhat trusted nodes are common for both structured and unstructured networks, as well as for use of social graphs: if those would be provided by peers, a client may decide by itself which algorithm to apply. This reduces the task to just including known nodes into local directory entries, which can be shipped over any other protocols (e.g., HTTP, possibly over Tor).

Knowledge representation, which is needed for a generic directory structure, is tricky, but there is RDF (resource description framework) already. There is FOAF (friend of a friend ontology), specifically for describing persons, their relationships (including linking the persons they know), and other social things. A basic FOAF search engine must be fairly straightforward to set: basically a triple store filled with FOAF data. See also: Semantic Web.

Hubs and addressing

As mentioned in the "usable systems" section above, the systems relying on peering seem to fare better in practice: they are still distributed, on the level of servers (or hubs generally), which then take care of tricky parts on behalf of the users. This is also how postal systems, telephone ones, and the Internet itself are organized. And some of those federated systems can be quite close to distributed ones: for instance, it is easy and viable to set an XMPP or a WWW server on one's personal machine, although normally addressing is centralized in those cases.

The Magnet URI scheme combines content addressing, which is not centralized, with a list of addresses to bootstrap from. Perhaps one can similarly use public keys, with claims signed by those, which would be very similar to certificates and key servers. No nice and human-readable addresses that way, as usually is the case with distributed addressing, but this creates a decentralized identity, decoupled from any particular nodes.

There is the similar concept of self-sovereign identity, with decentralized identifiers (DIDs) as a fairly generic framework. Similarly to Activity Streams, they are based on the awkward (but RDF-compatible) JSON-LD. See DID Methods for more specific specifications, though many of those are blockchain-based (probably because DID appeared when those were particularly hyped/popular).

GNUNet's GNS (RFC 9498) has a DID method defined. It combines local "pet names" (aliases) and memorable labels (subdomains), with public keys as unique zones (identifiers). For DID identifiers, they simply use GNS zone keys, and store DID documents as records of type DID_DOCUMENT under the "apex label". Zone delegation is similar to that of regular DNS. Both GNS and R5N (GNUNet's DHT) look fine. But TLSA records don't seem to work with its dns2gns, and even if they did, they would not be trusted without DNSSEC, while CAs do not support GNS. So the software would have to support GNS explicitly, at which point it could as well use GNUNet's CADET instead of TLS. But the main GNUNet implementation is under AGPL, which is not likely to help a wide adoption via embedding into existing software.

Another effort to organize name lookups not dependent on ICANN is OpenNIC, but there is an alternative DNSSEC hierarchy, including the keys at root, which breaks usual validation for ICANN domains. And it is still a centralized system. Maybe memorizable and human-readable addresses are not that important anyway: it seems that people rarely remember those, do not operate those directly (using non-unique nicknames instead), and happily use phone numbers, sometimes even preferring those over memorizable addresses.

But back to more practical (readily usable) systems, OpenPGP certificates actually are quite similar to Magnet links, in that they ship a public key, along with one or more identities, which usually are email addresses, and those can be retrieved by various means (DANE, WKD, various key servers, manual exchange, etc). I think it keeps being "pretty good", for many use cases.

Weather data

Except for common messaging and file sharing, one of the distributed (or at least federated) system applications I keep considering is weather data sharing: it'd be useful, and it's quite different from those other applications.

Weather data is commonly of interest to people, and it's right out there, not encumbered by patents or copyright laws, just has to be measured and distributed. But commercial organizations working on that try to extract some profit, so they don't simply share that data with anyone for free. There are state agencies too, paid out of taxes, but at least in Russia apparently you can't easily get weather data out of it either -- only a lot of bureaucracy, and even if it was possible, there are many awkward custom formats and ways to access the data, which won't make a reliable system. People sharing this data with each other would solve that problem.

Though there is at least one nice exception: the Norwegian Meteorological Institute shares weather data freely and for the whole globe. While Germany has Deutscher Wetterdienst: API, and the US has weather.gov. Also open-meteo.com appeared recently.

The challenges/requirements also differ from those with messaging or file sharing, since there's a lot of data regularly updated by many people, and potentially being requested many times, but confidentiality isn't needed. There already are protocols somewhat suitable for that: NNTP (which is occasionally used for weather broadcasts, just in a free form), DNS, and IRC explicitly aim relaying; SMTP (with mailing lists) and XMPP (with pubsub) may be suitable too, possibly with ad hoc relaying.

For reference, as of 2022 there are about 1200 cities with a population of more than 500 thousand people; individual hourly measurements from each of those would constitute a message per 3 seconds. Wouldn't harm to have more than one weather station per city, to cover smaller cities, and so on, but the order seems to be manageable even with modest resources and without much of caching or relaying, assuming that there are not too many clients receiving all the data just as it arrives.

The links/peering can be set manually, and/or data can be signed (DNSSEC, OpenPGP, etc) and verified by end users with a PKI/WOT; the former may just be simpler, and appears to work in practice.

Collaboration/coordination/organization is likely to be tricky, though possible: plenty of people contribute their computing resources to BOINC projects, OONI, file sharing networks, and so on. But weather collection is different in requiring special equipment (at least a temperature sensor) being set outside, complicating contribution.

Post-quantum cryptography

Many of the protocols mentioned here rely on asymmetric cryptography, which is particularly vulnerable to attacks by a quantum computer, and it seems that at this rate we may have usable quantum computers before widely used distributed systems. Use of symmetric cryptography, or at least cryptographic agility of the protocols, is needed to mitigate that.

Beyond technologies

Primarily technologies are covered here, but non-technical means may be quite helpful as well. Social skills and connections may be more useful to stay connected, and to actually engage into social activities. While a decent government is supposed to help people, rather than to be a threat actor, both online and offline. Throw in good ISPs, and a few centralized systems maintained by well-meaning and competent people, and one wouldn't even need any channel encryption for most tasks.

People don't quite work that way though, with governments apparently trying to turn into authocracies, any non-awful ISPs being acquired by awful ones, people in general being prone to mischief, and some of them engaging into crime, so some technical measures are needed, but some social and organizational help is important as well.

Additionally, the combination of social connections and relatively basic technologies allows to build friend-to-friend networks, reducing network abuse.

Users

Distributed systems, particularly when used for social activities, require users – so that there would be somebody to send messages to in case of an IM. That's quite a problem, since even by sticking to federated protocols it is easy to lose or decrease contact with people.

People in general are capable of dealing with even more complicated and less sensible systems, as digital bureaucracies demonstrate, but apparently not motivated enough. I am somewhat interested and motivated myself, yet occasionally after looking at software with many dependencies, reinventing many parts, and generally going against what I view as good practices, I do not feel motivated enough to try those.

Search in particular is tricky in such systems, though usually some form of communication with strangers and self-organization (e.g., via multi-user chats, web pages) is possible, so that people can find groups with shared interests. Perhaps being sociable is easier and more useful than technical solutions there, too.

See also

Not quite about collaborative protocols like those listed above, but just about distributed computing (including software design aiming multiple servers controlled by a single entity), there's a nice "A Note on Distributed Computing" paper.