Part 4: Machines Talk to Each Other

Networks are a bunch of computers linked up with cables, or something else, that let them send bytes to each other. The connection is often by radio waves: wi-fi, mobile networks, satellites; but for the longest time it was just cables, and they are still the fastest way to send data, and most stable. That’s why most terminology is still tied to that – time spent transferring data between machines is called “wire” time, you connect via “ports” and so on.

The communication between machines is very standardised, there is a stack of abstraction layers and “protocols” with a bunch of documentation and standards. Most of them are documented in “RFCs” – requests for comments. They are really standards, but originally written and discussed by committees with no real internet, and to be nice to each other they came up with this documentation format that fosters open ended discussion; this is not telling you what to do, it’s proposing something, feel free to comment and we’ll come up with something good.

They are traditionally written in plain text format. They are really a community owned technical documentation about how the internet works. Here’s one (defining HTTP) https://datatracker.ietf.org/doc/html/rfc7540. Full list here: https://www.rfc-editor.org/ , interesting ones here: https://en.wikipedia.org/wiki/List_of_RFCs. There is a body in charge of it all: https://www.ietf.org/.

The actual technology used really only matters while you are setting it up and thinking about speed, stability, cost and all that. There is a very low level abstraction layer that all this hardware implements and in practice the actual medium quickly becomes irrelevant. We still talk about LAN (local area network), WAN (same but with wifi) and so on, few others … but really, it doesn’t matter, certainly not when talking about what you are actually doing with it.

At the hardware layer, every connection to a network is identified by a MAC address (https://en.wikipedia.org/wiki/MAC_address). A weird string of numbers with an actual standard around it that ensures that every individual physical network connection point ever made has a unique value. They can be faked, modern operating systems let you just do it yourself as an option, for privacy. But they are pretty robust, chances of having duplicates on a single network even if you fake it are miniscule. They are there for networking hardware to do its stuff, and for MI5 & co to keep track of you. You never really deal with them directly.

This layer is called “Link layer” in geeekspeek (also in documentation and tutorials). The layer below it, actual wires, cables in oceans, radio networks, network cards … actual tangible stuff is called “Physical layer”. You never think about the physical layer really, but that is where the money and control both come into it. Somebody owns and somebody controls every part of it; governments, corporations and such. It ain’t free in any meaning of the word.

Software communication layers are immediately above the link layer, implemented by operating systems (though often with lots of support for it in hardware, but that’s just for speed). They are far more interesting. On top of them is, finally, distributed computing, internet, clouds, and all the other stuff that the world runs on.

There is so much detail to it, you can spend a lifetime just reading about it; people do and then they find joy in throwing obscure terminology at others and feeling important.

There are loads and loads of tutorials, explanations, videos, articles … search for “tcp ip”. I can’t recommend one of them; but it is definitely worth digging through it all a bit after you read this, to get your ears used to the lingo at least. (seriously, it will help)

We won’t spend time on technical detail, just general principles, terminology, and an understanding how it all gels together. Mostly because it is really boring otherwise. Also because we do not find pride in memorising massive amounts of details, but are instead seeking the satisfaction of enlightened thinking and understanding.

Before we dive into it, here’s a good picture about the layers that I stole. It’s missing the physical layer, most pictures do coz it ain’t pure technology, and it’s depressing, but that’s ok. Come back to it later, it will make sense then.

TCP/IP

All we really want to do is transfer some bits from one machine to another. During the transfer, we don’t care what these bits are – the sender will encode their data into them, the receiver will decode it and do its thing.

This in itself is not actually too hard, once the hardware is in place, what is really hard is making it practical. Firstly, there are a lot of computers, and they are all over the world; they need to find each other, and on each computer there may be many connections to other ones, or many logically separate connections to the same one. Then there is the dirt of the real world, interference, lines breaking, eddy currents in wires, electrical resistance in wires, bugs, storms, powercuts, and humans – all making the connections unreliable. Finally there is speed; this is the slowest part of hardware, and there is little control over the speed of individual machine in the vast network of nodes (you can make your computer as fast as you like, if it doesn’t get the data to work with it’s just a heater that needs to be cooled down).

We analysed this in great detail, broke it down into layers and use cases and came up with “protocols” for each. There are lots of these protocols, the whole suite is called TCP/IP, after two of the most used protocols in the mix. A protocol is a specification of messages that need to be exchanged to establish the connection between machines, negotiate what is to be sent, how much of it, in what way, and so on; together with the layout of a “data packet” for each.

A data packet is a data structure containing the data to be sent and metadata about it. Each layer only cares about the metadata and the size of contained data; the contained data will contain the meta data for the next layer and the data load for it; which often contains metadata and data for the next layer. We will look into it more, but the crux of the design is that the layers are software; each deals with its own set of problems, solves them and delivers a data load to the next one, without any knowledge of what the next one is or how it interprets the bits in the data load. At the very top are the applications that interpret their data as music, HTML, porn, code, anything you like. At the very bottom are pieces of hardware connected through some physical media sending signals that encode 1s and 0s.

Along the way, a packet gets unpacked, interpreted and resent a number of times, but only to the relevant layer, as we route it across the network.

Network Topology/Routing

Network topology is a smartass way of saying “the layout of cable connections”.

Originally, the wiring was direct, each computer to each other one. Didn’t last long, of course. We quickly started realising that it’s a lot better to route data packets from one computer to next (in hops) between the source and the destination. There are various documented ways to do this now, rings, starts, whatnot … this is network topology.

In reality it’s all laid out the same way these days. We use routers. These are specialised servers, often built specifically for the purpose (like the one in your house), in almost all cases running some version of Linux, and with lots of physical ports to plug cables in or a radio station to do the equivalent for WiFi.

When you connect your computer to a network, it sets up a connection with the router. Either it sends the signal down the medium its using (wire or radio) asking “yo, where are thou router?”, or pick up a message that the router sends continuously about itself. They then talk a little, exchange their MAC addresses to introduce themselves, the human given names for each, details of the network speed, supported protocols, stuff like that (not sure about the details). Finally, if they agree, the router assigns an IP address to your computer – four numbers like 192.168.0.12. (these are called v4 addresses, there is a new v6, looks weirder but can support a lot more machines connected).

The router itself is connected to another router, and that one to another one, and so on. At the very top there are only a few around the world that oversee everything, connected to each other and redundant (used to be four, I’m pretty sure there’s more now). The addresses given out are broken down in these layers – your router is given an address from its router, and that address is unique on that part of the network, but another router can give out the same one to something on its own sub-network. Similarly, your computer’s IP is only unique on the network you are directly attached to (in your home, or in office). There are conventions about how the IP addresses are assigned, and internet wide there are order of magnitude fewer addresses than there are computers. If you go online and search for something like “what’s my IP address”, what you will see is the address of one of the routers in the chain, something pretty near you – the same address will likely show up for anything in your home, likely even your neighbour’s home. And this is not the same address you would see if you ask your computer “what’s your IP address”.

This is all a part of the IP protocol in the network layer. It is active throughout your connection to the network. When you send data or receive it, the messaged and the data payload eventually, travel from one router to another, with just enough metadata to reach the destination, and every router along the way will unpack that part, look up the tables it maintains that help it figure out the best route to the destination, package it up again, and send it to the next router.

There are other protocols in the network layer, all are to do with establishing and maintaining connections and sending data down the line. ICMP is what happens when you ping a computer, you type the command “ping 124.87.12.1” and it sends a small packet to the computer with that address, which in return sends one back. The time is measured and few of this cycles happen. Super useful for checking if a computer is there, if the network works and so on. There are others; no idea what they do, but it’s always easy to find out if you need it.

Transport Layer

Network layer deals with establishing connections and machines finding each other and being able to send something to each other. Transport layer is all about the data, still without caring what the data actually is.

The way it works is that whatever data you are sending, it is split into chunks, fairly small ones – 64k or less. These are then indexed and sent separately. Once received, they are most often assembled on the receiving end and passed on to the next layer, or just your code directly. More recently, there are a lot of use cases when they are processed immediately and things are done with them while other chunks are still arriving – this is streaming.

Either way, what matters is that each packet is sent and received, hopefully in order but that is not guaranteed. The most commonly used protocol for this is TCP. It is put together to include lots of back and forth communication to ensure consistency, because packets do get lost for various reasons, so within TCP receivers confirm receipt, or can ask for a packet to be resent. Another popular one is UDP, the major difference is that it doesn’t include confirmations. Sometimes they are not necessary, either because there is a lot of redundancy or confirmations built in another way, or because it doesn’t matter. It is used very often to broadcast information – when you send a message to anyone on the network that wants to listen.

This layer also introduces the concepts of “port” and “socket”.

A socket is a software abstraction, you write code that is sitting and waiting to receive queries via TCP or UDP; the end point where it flows in is called a socket. Different operating systems and different languages implement these very differently, it is a very pure abstraction.

There could be many connections open on a single computer doing different things – one for the web, one for a webex call, another one for a particular web based interface to some special functionality on the server and so on. This is handled by adding a field to the IP address of the machine – a port.

Once a TCP packet arrives at a machine, it is then also directed to a particular port – the operating system maintains a list of “open ports”; ports where some application is expecting to receive messages. Some are standard. For example, for http it’s 80 or 8080, you don’t see it because it is assumed, but the actual address of an internet page looks like 84.192.0.12:80, for encrypted http it’s 443, so whenever you have https:// the address gets translated to something like 84.192.0.12:443.

Most often in code, opening a socket means telling the OS that you await messages on a certain port (there’s other things to do, but that is the key point).

People still do this manually sometimes, but it is messy, boring, and repetitive. There are frameworks and libraries that would deal with all this for you. It is rarely worth diving deeper, generally only when you need to start optimising. Though even that is rare. Over the years technology got quite good in optimising these parts for you. You will generally do much better optimising your data loads, making them smaller. I never had to look into sockets – only implemented this at low level once, for fun. Here’s a blog with more details, pictures are really the interesting part, to give you some feeling about what these packets look like: https://www.techrepublic.com/article/exploring-the-anatomy-of-a-data-packet/.

What you do have to do at times is look into configurations; both software and hardware did get good at optimising it, but it is not automatic, there is a plethora of options and some hardcore engineering involved in keeping it going, making it fast, and reliable. It’s expensive to do.

One way around it is to trade quality for scale – it’s ok if larger percentage of communications fail, even nodes on the network, or even if they are individually slower; as long as you can build in enough redundancy and scale: instead of five super stable and super fast computers, use twenty dirt cheap ones; distribute the work more, duplicate some percentage of it. Scale it up enough and it starts being a lot cheaper, and easier to run, you can automate much of it, you can provide better services for when and where they are actually needed, rent it all out by microsecond. This is what clouds are. We’ll talk more about it.

Application Layer

Here we finally arrive at humanity. This layer standardises commonly used applications of the layers below it. Here are the protocols and formats for emails (POP3 and SMTP), web (HTTP, HTTPS), file downloads (FTP), encrypted communication (SSL), human readable web addresses (DNS), and a whole lot more.

These all talk either about the content of the data part of TCP and UDP packets, or about the business side of the internet, for example:

DNS is all about how to translate autotelica-fineng.com to 89.123.0.101:80. There are DNS servers along all the IP routes, google owns the most important ones, which contain tables of these mappings updated by other hosts and routers.
HTTP describes a list of types of requests that a web server can handle and how to deal with them. Things like:
- GET, to get data, download a webpage, or a file, or something
- POST, to receive data (mostly from forms, but anything with lots of data)
- PUT, DELETE and so on
- Nice documentation here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods

This is actually useful to read more about.

HTTPS add a layer of encryption on top of HTTP
FTP, not used so often but still there, is all about moving files from one machine to another.

This is what most coding happens against. Generally there are libraries that wrap the data packaging and un-packaging, implement some form of messaging to deal with requests and responses, some kind of abstraction layer to deal with it all. They are super useful, you can have a webserver up and running in Python in an hour or so. But what makes it all tick is standardisation – these layers are implemented many times over, for different OSs, different requirements of speed and stability, different languages.

REST & Microservices

An important methodology on top of this layer (specifically on top of HTTP methods) is REST. It’s not really a standard, it’s a self-help guide on how to be better.

HTTP specifies a number of methods and their operations, with the idea that the method names will suggest their best use, and with hints on how best to use them; but there are no restrictions insisting that you follow this. Which is fine; it does sometimes make sense to update server databases using GET method instead of a PUT method, just makes interactions nicer. REST is an attempt to restrict that – basically sets out very clear rules about what which HTTP method is to be used for and that’s it, if you don’t do it, you’re bad.

It’s based on a PhD thesis of one guy, and it is good work. When it is applicable, you flag your server as being RESTful and it helps the clients write code to use your server.

However … 1. I am yet to see a real life service that is fully RESTful, 2. There is so much noise about it, that pretty much everything that uses HTTP methods to communicate with servers calls itself RESTfull. It’s ok, no need to be an asshole about it, but you do need to be aware that the term is used differently than intended, even by those who misuse it.

Another term that keeps flying about is “microservices”. This is really very poorly defined, and it gets merged with REST because the two do work nicely together.

The idea is that you write small components, little services that only do one thing, that are stateless, and you put many instances of them on the network (cloud). They are small and dumb, no GUI or anything, they just receive commands in the form of HTTP methods (GET, PUT, POST, DELETE etc). It is meant to be the Unix philosophy for distributed computing.

Truth be told, when you can fit your design into it, it is great. It makes building complex systems easier, there’s a natural split of responsibilities, no code dependencies, APIs are language agnostic (they all use just HTTP); all cloud technologies work great with them too.

It isn’t always easy though; or worth it.

Let’s talk about clouds a little first, we’ll come back to this.

Clouds

Traditionally, distributed computing is expensive. You need machines that sit somewhere cool, provide a lot of power, good wiring, clever software to receive tasks to do for you and monitor them, careful setup to be able to perform the tasks you ask it to, APIs and GUIs to monitor it all, servers to coordinate it all, a team of people to keep it going.

On top of that, repurposing a server is hard; it has to have all the software it needs to run your taste installed on it. It has to have the right versions of it and all the libraries it depends upon. Often there are builds of software specifically for the servers like this, entire test cycles too specifically for them. Often they start failing or producing spurious results because someone installed something that overwrote one of the libraries you needed, or the updates of your own ones failed on one machine when you were releasing a new version. All sorts of problems.

Then there is integration of your software and processes – all this is controlled by some third party software that looks after the working servers liveliness and load, queues up tasks and prioritises them, re-runs them according to what policy you configure and so on. You have to work with this software. The most popular solution is DataSynapse, but there are others.

Enter containers and clouds.

They work together, to make the world a better place and to make google, microsoft and amazon richer. They do a pretty good job of both.

Firstly, containers. The most dominant technology here is Docker. There are quite a few others, and quite a few are actually used, but Docker is the daddy.

What happens there is that you are given tools to fairly simply create a virtual machine that is tailored exactly to your needs. You can’t control the hardware part, but everything from the operating system upwards you can. Then you package the whole thing into a single file, a binary image of your virtual machine. This file is called … an image.

Now you can take that image, and “run” it, basically like booting up the virtual machine and letting it run, then telling it – launch this particular application; and the application is the one you want to be executing. Once it’s running, the running instance is called a container.

If you have eight CPU cores on your computer, you can run eight instances of this.

You can upload your image to a repository, and you can run 100 instances of it on a farm of computers.

The container is very well insulated and so is the computer it runs on, security is built into every step of it.

Most often, the image sizes are less than a 100Mb, a Gb is considered rude. This is easy with Linux because it is so modular, you only include those parts of the OS that you actually need.

At this point you have:

A complete installation of all the software needed to run your service
It is replicable – you just copy images around, if it runs on one machine, it will run the same on all others (ok, not 100% true, but pretty close)
You can run it locally and test all you like, you can even run it and do your development within that environment, like you logged into a remote machine.
You can handcraft it and optimise it to your heart’s content.
It’s a binary image, but it is built from a kind of a script (Dockerfile), with simple and clear syntax and support for making things modular too – it is like source code, completely tractable.
All you really need to run the binary image is an installation of Docker, and there is one for every OS.

It’s really very good.

Now, you just need somewhere to run this on – a cloud, a bunch of computers with Docker installed. Either you build your own, just buy a bunch of computers and a team of people to maintain it (though people build their own ones at home from a bunch of raspberry pis). Or you rent by some fraction of a second of compute time from someone who has one (basically Google, Amazon, or Microsoft).

Either way, there will be another layer of software to manage it all. Kubernetes is a free one, and the most popular, but there are others, and in particular the big vendors have their own. Regardless, they all do more or less the same thing:

You tell them to run your image, and they do
You tell them how many instances you want, what minimal spec machine you need (how much RAM, how fast, including powerful GPUs etc); and they go and find the machines that suit and launch your image.
They also monitor your containers and if they crash, they spawn new instances.
If they get busy, say using more than a certain threshold of memory or CPU, they spawn some more instances.
If they are lazing about, they could kill them and spawn new ones when needed; or then they can just move them to another machine if they think it best.
All manner of other things to keep it all going.

There are standardised tools to keep track of error logs across all these machines (ELK stack. Lots of tutorials, e.g. https://logz.io/learn/complete-guide-elk-stack/). They are made super simple to use – your code writes logs to standard output, and kubernetes redirects that to a central logs database for you; then gives you mighty powerful ways to search, analyse, picture, do statistics on them.

There are standardised ways to set up services that distribute REST/HTTP requests to any number of instances that sit behind it (ingress), so that you can write a simple service that deals with requests and between ingres and Kubernetes the scaling is handled for you – in quiet times you run 1, in busy times a 100; users only ever see one IP address and a port (that of the ingress service), and the ingress sends it on to one of the instances that kubernetes spawned for you.

Data is a problem – both security and transport times can get messy. There are standardised ways to deal with that too.

The pattern where you write simple services that come up and down and deal with requests quickly is very popular (microservices), obviously, but to make it work properly it is best if these little services are stateless – this means that between two requests nothing is changes internally, if you make the same request twice, you are guaranteed to get the same response. Even cashing locally is frowned upon, but you can get away with it. This helps guarantee simplicity and scalability – if it is all stateless, whether your request arrives at a new instance, or an old one, it doesn’t matter at all.

The trouble is, it can mean that you need to go elsewhere for your data all the time – to price a trade you need lots of data from a bunch of databases, and in a traditional setup you would just pre-load this into your service and happy days; with stateless microservices you can’t do that. It sounds like a limitation, but it isn’t really, just good design; you can ignore it if you like, but life is easier if you don’t. So what do you do? You build distributed network caches for your data. Not as good as having it on your machine, but if you are just a little bit careful about your data sizes, and make these caches real fast, it’s pretty good.

There is an entire world of solutions trying to address some of the common issues of running things this way, or to automate workflows between microservices, some better than others, but plenty to choose from.

It is a major change of paradigm though. Writing new stuff for it isn’t hard at all, a bit of training, boring documentation, some tutorials, and support from a good DevOps team (haha, I know), but in principle it’s ok.

What is hard is migration of existing systems. You are completely shifting focus from something that benefited from being monolithic (because if you dump a monolith on a server, you have a better chance of everything being in sync), to something that is as granular as you can make it.

Another hard thing is the skillset – with all the beauty of cheap computing and promises of automatisation, DevOps are still incredibly employable. It ain’t all that automatic after all. There is always an embarrassment of problems, like secure and fast access to data, generic equivalents of RAID servers, needs for special features like GPUs or large RAM; and there are a number of solutions to each that purport to be standardising solutions to all of this. By and large, these are packaged into competing and often incompatible, mostly half-baked, solutions that require specialist training to even understand, not to mention configure and maintain.

In the fullness of time that will work itself out, it’s just a new division of labour, but it takes time, boundaries are not clear yet, needs are not quite understood across team boundaries. It can be done though; I know of at least one project that looked into this for quant needs and worked quite well, and some are being built in various start-ups.

Part 5: Talking to Machines

Tata Says