Part 1: Layers

Over time we developed layers upon layers of abstractions to be able to make programming practical.

At the lowest level is physical hardware, the electronics. Overall this differs from one computer model to next, from one type of CPU to next. It’s a nightmare. These days you very rarely need to even think about it – either in specialised hardware, like for the lamp with pretty changing lights, or when you are talking about insane level optimisations.

Next level is often ignored and conflated with either the one above it or the one below it, but it is the most commonly used one in conversations. It’s about hardware architecture in abstract terms. E.g. every computer will have RAM memory, CPU, bus – in each computer these are actually different chips or wires, working at different speeds, and at different temperatures. This is where we will spend most of the time in this write up.

Operating system sits on top of this. Windows, Linux etc. You interact with the computer through it both in daily use and while writing code. It talks about things like memory, disks, threads, files, apps and so on. It is a piece of software that translates these concepts to actual hardware operations; generally via “drivers” – a driver is a plugin for each specific type of hardware; while all printers print, each has its own interfaces and technical capabilities; same for keyboards, graphics cards … Drivers are written and provided by manufacturers, these days they just register them with whoever owns the operating system and it gets installed for you. The really big deal with USB was the standardisation of this communication – the shape of the plug for sure, but the actual communication protocols matter far more than that. Operating systems hide the machinery from you and replace it with abstract concepts that we can talk about.

Then come programming languages. They are ranked according to how high level they are; this is really about how much of all the stuff above they abstract away and translate it to pure concepts recognisable in human languages. C is very low level, C++ a bit higher, Java a bit higher still, python quite high, functional programming languages like Haskell very high; various languages developed to deal with specific domains (DSLs) even higher.

Physical layer

This is the most boring part; the reason is that all the marketing for ages now is telling you that you don’t need to know about it, but it’s bullshit. We’re not going to go too deep, no point, but we will make a story about what is happening. There is too much detail in this section, but it will be a good reference later.

Why are digital computers so popular?

Because of flip-flip gates and logic gates. These are electronics components at the core of it all. It turns out it is (relatively) easy and cheap to make a component that will be like a switch that gets turned on or off automatically by electricity and then it stays in that state and you can read that state automatically. This is a flip-flop gate. In reality, they actually need to be refreshed, just give them a bit more electricity every nano-second or something, so that they don’t forget where they were. Now, take the on state and call it 1, take the off state and call it 0 – each flip-flop gate represents a bit. Eight make a byte and so on. With this, we can store data and read it whenever we like, just by sending some electrons around. All we need to figure out is how to encode all our data into this format. (later)

Logic gates perform boolean logic functions, e.g. a NOT gate has one input and one output, when there is electricity arriving at input, there is none at output and vice versa, AND gate takes two inputs and one output, output is only on when both inputs are one etc. There is a whole lot of detail and we can talk about it, but the core idea is that between the 19th and 20th century we figured out you can do all logic and arithmetic using just such buildings blocks.

There are other technologies that we came up with to store data long term. The first was using paper cards (a card is split into lines, each line into cells, a hole in a cell means 1, no hole 0), then magnetic tapes (like cassettes for music, you record a little thing for 1, another for 0, then read them serially), hard disks (like tapes only round, hard, much faster, and more durable), CDs (shiny surface is split into cells, a dent in the surface disperses the light, no dent reflects it back – 1s and 0s), ROM and EEPROM (like flip flops, but don’t have to refresh them, they remember), solid state drives (like EEPROM only faster – USB sticks and SSD drives; very fast, but fail more often than hard disks).

The key point is – we can send electrons around, little bursts of electricity, and change the state that is stored in silicone atoms, and take actions based on a combination of outputs of logic gates. This works both internally and externally – within the computer and over wires to keyboard, screens etc, and over longer wires for networks. We just do this all the time – send little pulses of energy around.

Hardware architecture layer

The details of the actual hardware vary a lot, but in general they all operate like this.

Internally in the computer, we bunch all the flip flop gates together on a few chips – this is RAM. It is very fast, but needs power all the time to remember what it has.

We keep long term storage separately; stuff that doesn’t disappear when you switch off power – these are drives, hard disks or SSD (SSD is much faster than hard disk, but less reliable, both are waaaay slower than RAM). Painting things needs lots of maths, so we build separate little computers that do just that real fast – graphics cards. Most logic execution happens on the CPU, a separate chip that runs the whole show – these days they are built with multiple “cores”, basically 4, 8, 16 etc separate little CPUs all running in parallel, like having multiple computers all using the same memory, disks etc.

We then send little pulses of electricity, often single electrons, along lines on a printed circuit board, a bunch at the time. We call those lines “bus”. We parallelize this for speed: 8 lines = 8 bit computer, 16 lines = 16 bit computer etc. (in fact, often you have separate busses for separate things, but that’s an implementation detail)

RAM is viewed by the rest of the computer as a massive array, each cell contains a single word – 8 bits for an 8 bit computer, 16 for 16 bit etc – the width of the bus determines this; you can read one word at the time because you can send it in parallel down the bus. To access a single cell in RAM, you use its index in the array – this is a memory address, or “a pointer” in most programming languages.

All components have to be able to read and send data in the same parallel width for this to work. Generally they settle on the minimal bandwidth available, but translations happen.

Also, these pulses can’t just go around willy-nilly, components read them at certain time intervals, for example RAM chips that work at 3200MHz read all the lines on the bus 3,200,000,000 times a second (not quite but about halfish of that, it’s marketing). There are a bunch of “clocks” in each machine to initiate these actions, little quantum mechanical components that emit electrons at very precise intervals – this sends an impulse that triggers a read, or a write or something like that, just kicking off whatever is your component’s next action is. Many components have their own clocks built in, many piggyback on the CPU clock.

CPU itself is generally the fastest component by far (no point in other things being faster if the CPU can’t keep up). Internally, it has its own layout – a tiny version of its own bus, a few subcomponents, the clock, and super importantly – registers.

Registers are incredibly fast tiny memory cells. One word each (8 bit for 8 bit computers, 16, 32, 64). There is anything between 10 and 30 depending on the chip make. Think of them as variables. Some have a special purpose, to keep track of which instruction in the code you are at (it’s called Instruction Pointer), or where the stack is (Stack Pointer, later). Some are general purpose – you can stick any number in them you like, you can increment that number, decrement it, stuff like that. Most computation happens ultimately using these. In C and C++ you can tell your compiler to store a variable in a register as soon as it can, if you know it’s going to be used a lot (modern compilers figure this out for you real well).

CPUs themselves are built with circuitry that does basic commands – add one to a number, check if a number is zero, compare two numbers, if a number is zero skip a bunch of commands and continue from another place etc. These commands are encoded in pure binary, but are translated into short mnemonic words for human consumption – this is assembly language. It is different for all processors (in families, all intel and AMD processors use the same assembly, for example). The binary (numerical) encoding for each command is called op-code.

CPU also has a whole lot of pins – this is how it sends signals to every other component in the machine.

To execute a single command a CPU does a bunch of steps, one on each tick of the clock. For example, to add two numbers, assembly would look something like:

ADD #FF3812, #FF3825

The first part is the instruction (ADD), the other two are places in RAM where the parameters are stored. The whole thing is stored in RAM (this approach is called von Neumman architecture), so we three words to store this instruction: one for the instruction op-code, one for each address. There is a special register always called Instruction Pointer – it tells the CPU where the next instruction is in RAM.

Read the instruction pointer to find out where the next command is in RAM.
Fetch the command op-code from RAM.
Increment the instruction pointer by three to point to the next command for later (because this command is three words long in total).
Realise that this command is ADD and it needs two numbers.
Fetch first number from RAM.
Put that number into a register.
Fetch next number from RAM.
Put that number into another register.
Add the two numbers together.
Put the result into the first register.
Done, whatever instruction comes next, it can read the result from the register.

They call this Fetch-Execute-Write cycle.

There are other instructions (JMP – jump) that would just put a new address into the instruction pointer. This is how GOTO statements work.

There are yet other ones that will check what value is stored in a register, and if it is 0 they will do nothing, if it is anything else they will do the same as JMP. This is how IF statements work. (perform a local computation, store the result in a register; next instruction is this conditional jump, so if the logic computation returned TRUE (anything other than 0), we skip a bit of code).

There is an important concept of a stack in there. It’s just a bit of RAM allocated to your program at the start of execution. It is accessed in a special way – you put a number on it and then that number is on the top of the stack, you push the next one on, that number is then the top of the stack. The special thing is that there is no indexing, you can only ever read the value that is at the top, the last one that went in. (last-in-first-out – LIFO they call it). Once you read the number from the top, you remove it (pop it). Stack only has three operations: push, top, pop. There is a special register on the CPU that keeps track of where the top of your stack is in RAM (called Stack Pointer). Simply the address of the cell at the top. You push something on the stack, it moves up by one; you pop something it moves down by one. The reason this inconvenient thing is used so much is because there are no calculations of indexes involved – you only need one value to be able to read or write it, and you keep that value in a register. It’s super fast while being annoying. There are assembly instructions to deal with this directly, so the whole thing is as fast as it can possibly be. It matters a lot to how programming languages actually work, we’ll come to that.

Other commands simply send signals to some of the pins, which then travels via bus to something else. If that something else is connected to the outside world via wire, it travels to that something else. This is how you would send colours to the LEDs to make pretty lights – each LED has another small chip that deciphers the colour from the binary number and does its thing.

Other ones read signals from pins, again if these are connected to something else outside, those signals get read and interpreted.

Put these reads and writes together, add a convention about what the signals mean and this is how keyboards work, mice, RAM, hard disks, Bluetooth … everything – importantly, networks too, wires outside are just longer.

It is called information technology because most of the time all these things just encode, decode and interpret this information we send in this way. The trick with digital technology is that no matter what the medium is, all we need to do is figure out how to represent 1s and 0s (power on, power off, radio wave this shape, radio wave that shape, light on, light off … whatever); then we add the interpretation layer, some redundancy (because shit interferes in cables and radio waves), and we can go go go.

The whole thing was practically invented in the late 1940s, mostly by a nutter called Shannon. He came up with “bit” and “byte”, first encodings, ways to figure out the noise and remove it … lots of shit. He was dealing with radio transmissions, but it makes no difference to theory.

There are downsides, but we’ll get there.

Speed

So we have a basic framework:

CPU executes instructions at every tick of its clock.
Instructions and data are all encoded as numbers (in binary, but who cares).
They are both stored in the same places (RAM, hard disks whatever).
Numbers are encoded as electrical impulses and fly around all the time.
They are encoded and decoded at each endpoint.
Every component will have some sort of chip to do the encoding and decoding and act upon in (these are called microcontrollers, each is a tiny computer itself, one of these drives the lamp).

Whenever you talk about speed, there will be at least one asshole that will point out that the best optimisation is at the algorithm level – make your algorithm smarter so it works faster and you needn’t bother with the complexities of the hardware. No, it is not the best, it is the easiest, and we know that – but we don’t care right now.

Making sure your hardware is as fast as can be will make things run faster independent of how smart your programmers are; but it can be expensive. So we do all sorts of trickery.

One thing to keep in mind is that speed is hard, expensive, and hot. The cooling bills end up being the highest cost by far when things get real fast. You won’t be discussing this often, it ends up being dealt with by some engineers somewhere, but it matters a lot (from aircon in the office buildings where it is the biggest factor of them all, to why Iceland got rich in the 90s because they had big frozen caves where google could set up their servers).

CPU

Two things affect the speed here – the clock speed (how fast can it churn through instructions), and how fast can you give it data to churn through.

You can buy a faster processor. If you are a hacker, you can “overclock” it – basically set it up to run at faster clock speeds. It makes them less reliable: they can generally do better than their advertised speed, but are only tested up to a point, they become unstable if you push it too far. Also they get hotter, so you need more cooling, at scale it becomes too expensive and unreliable.

You can get a processor that has multiple “cores” – basically on-chip parallelisation. Software has to know how to utilise that, but generally modern operating systems are good at it. It won’t directly multiply your speed because you still need to get it data to work on.

CPU utilisation is basically what percentage of time the CPU is working and what percentage of time it’s sitting and waiting for instructions or data. You can see it in the task manager – right now mine is at 6%, pathetic but typical.

So when you write code for things that take a long time, you try to maximise this; there are a bunch of ways to do it, but on the hardware level there’s things we generally do as standard nowadays.

Memory & pipelines

The biggest bottleneck is getting data to the CPU core for it to churn through, both instructions and the actual data. The distance between RAM chips and the actual CPU is a factor, that’s how fast it all is. Also, the speed at which RAM can fetch data internally and put it on the bus is way lower than how fast the CPU can process it. The reason is cost, it’s just expensive to make things fast, they need extra cooling, other things have to be fast to take advantage of it … the cost explodes. Sometimes people do build super fast and super expensive machines though, but most often you rely on level 1, level 2 and level 3 cache.

Level 1 cache is a small RAM sitting on the CPU, it’s as fast as memory gets, and it’s right next to the actual processor. So what you do is copy the current chunk of data from RAM into it, then let the CPU churn through that instead. You keep doing this all the time, a separate component of the CPU takes care of it. Can’t make it very big.

Level 2 cache is the same idea, only a little slower but a bit bigger, so you populate that first, then Level 1 from that.

Level 3, again, another slower layer, but slightly bigger, so data goes RAM->L3->L2->L1.

Now, how do you know what data your CPU is gonna need next? There’s algorithms built into hardware to try and guess, sometimes they miss and they waste time so they throw what they loaded out and load what they actually need. These are called … cache misses 🙂

Also, because even a single instruction gets executed in smaller steps (fetch-execute-write cycle), how do you get these micro-instructions to the CPU as fast as you can? There are hardware solutions – instruction pipelines. Similarly to how caches get populated, these feed the smallest possible chunks of data to the CPU by trying to guess what’s next. You can if you are insane or have a really good compiler try to affect this in your code … generally you want your pipeline full as often as possible.

RAM on the other hand can work faster than the bus allows, so it will have its own buffers – faster little caches that it populates with data waiting for the bus to be ready to pick it up.

GPUs

For when you need real macho number crunching speed, you get special processors to do lots of arithmetic in parallel. Anything between dozens and thousands of calculations at the same time. Stuff that does dozens is often built into modern processors, so you can get some parallelisation there; it’s super fiddly but worth it for things like modelling analytics in quant libraries. For more than that, you get graphics cards ….

GPUs are special processors to do those hyper parallelised computations. First they needed them for video games, they still do. Special effects for movies these days rely on them completely. And … Machine Learning. They are basically grids of hundreds of cores with special fast busses, on board caches, all sorts.

Disks

The slowest part inside a box is the disk. Hard disks are literally disks covered with magnetic material, they spin real fast. SSD disks are not disks at all, more like USB memory sticks. Hard disks are much slower than SSD ones, all way slower than RAM. But cheap as chips and store data even when you switch stuff off.

All the stuff above (caching, pipelines etc) is dealt at the hardware level, you can have some control over how it works, but basically it’s built in. With disks it’s different. Operating systems does the equivalent of L1 cache for you. Your code and the data live on the hard disk, then parts that are relevant get copied into RAM, from there on it executes as above. The chunks of data copied from the disk are called “pages” and the process is “paging”.

When you hear your hard disk buzz like crazy, it means lots of paging is happening. Either you are going through shitloads of data, or it keeps getting it wrong what is needed next (page – misses). Sadly, SSD drives don’t have moving parts and you can’t hear this any more, all you see is flashing lights at best. Geeks do this on purpose with hard disks to make music 🙂

A common trick that operating systems use is “virtual memory”, basically they assign an area on the disk, a file, that they use as an extension of RAM. So your RAM appears to be much larger than physical RAM chips. The operating system then moves data from the disk part to RAM part in the background using paging. Helps code run because code grabs areas of RAM for itself and you would quickly run out. But it takes a lot of paging to make this work, so it is slower.

It matters because this is why you try to make your software not use too much memory – this can slow it down way more than anything else.

Within the confines of a single box, another thing that is slow is screen time – lots of data needs to fly around for that, and humans are slow, so are screens. Basically, interacting with humans slows you down.

Networks

Outside of the box, the slow part is networking. It’s a real mess. You are sending stuff down wires, generally it is slower; then they are way longer distances (for L1 cache we are talking nano-metres, these can be hundreds of meters). And then … drumroll … every wire is an antenna, it picks up everything from the big bang noise to tv shows, solar storms, thunderstorms, stuff flying through other cables nearby. The world is just noisy, and this noise can interfere with the pulses you are sending, messing your bits all up.

One way around it is shielding – all network cables have a layer of tin foil or a thin wire mesh around them to catch this noise before it gets to your wires. Another way is making cables shorter, if you can, thicker if you can, better shielded if you can. Or you can use optical cables, thin strands of glass to send impulses of light down. They are faster and more resilient to interference but more expensive and more fragile.

What you always do is build in redundancy and send metadata that indexes your data, to make sure it all arrived. You break all your data into small chunks (packets), then to each you add information about how much data is in there, which packet of how many it is, checksum values (sum of all the bits in the whole packet basically), you do this in a few layers. Makes encoding and decoding much slower, but can’t do without it – these packets get lost all the time. Then the sender and receiver also talk about what arrived, what not, should it be sent again etc

On top of all that, computers always have to find each other, the signal travels through a bunch of servers till the recipient is found. All this information also has to travel with your data, get decoded, interpreted, looked up, sent on to the next hop. The standard convention for this communication is a set of protocols called TCP/IP. There are libraries to deal with this for you, nobody does it manually any more, but it’s there.

The medium over which you do this can change – wires (ethernet) are by far the fastest and go long distance. Radio waves can work over shorter distances (Wi-Fi and mobile networks, satellites sometimes) but are prone to interference a lot. This is why solar storms fuck up mobile networks.

The hardware inside your computer that does the first most direct part of this communication is ethernet card if it is over wire, Wi-Fi port if it is by radio signal.

You do what you can, but it’s still the slowest part by far. We call the time spent in transport “wire time”. It’s your biggest enemy in any distributed computation, clouds and all. Generally you look for software solutions – pre-loading all the data you can on the server in various ways is really the crux of it all. Most of the time optimising this you really spend figuring out what all can you preload and how. Another trick is streaming – break your data into small chunks, small enough to give the endpoint something to do while the next chunk arrives, send chunks all the time. Think YouTube – showing a 10-second long chunk takes 10 seconds, sending it is in milliseconds or seconds on a slow day, hopefully. It’s actually so slow sometimes that you buffer it – collect enough chunks to play for longer while the rest arrives and only start then.

The layout of the data really matters too, you need it accessed fast because with wire time included all the speed you can get matters. Often you build network caches just because of this. You preload trade data from all over the place, to reduce initial load and processing time; then put it on a very fast machine that is hopefully physically close to the servers that need it. You can have a number of servers that take data requests in turn to share the load.

Finally, you can have special purpose built hardware. Typically all sorts of databases run on some version of such hardware. RAIDs are one very common way to deal with it: you have a database server, or a data cache server, something that stores a lot of pre-loaded data for other servers to use. It gets hit by requests for data all the time. But for each request you have to go to the disk, find the data, do what processing you must (SQL) then send it back. What happens is that while you are dealing with the disk time, requests that arrive from the wire wait patiently. But often what you send back is a relatively small amount of stuff (e.g. if you are streaming). So disk access becomes the bottleneck.

What you do with RAID is install loads of disks, all containing exact replicas of the same data, So if you have ten of these, you can process ten requests almost simultaneously. If you really care, you build them to be massive (hundreds of replicas) and you also build special hardware to make the bus times super fast too. FPGA processors are type hardware that is sometimes used to make this work – they have lots and lots of parallel pipelines to send data through. It’s exotic and novel right now, but cheaper than GPUs (GPUs also do fast arithmetic, FPGAs just shift data in parallel real fast).

Essentially you parallelise all the hardware that shifts data around. It’s super expensive though, you really have to care for speed to do it.

Part 2: All You Need is Something and Nothing

Tata Says

Leave a comment Cancel reply