Part 2: All You Need is Something and Nothing

Oddly, the part about about the actual magic that lets us generate meaning from hardware, is the most tedious one.

The fact that we can find ways to represent all the symbols that have meaning to us using sequences of nothing and something is exciting, even inspiring: we can generate meaning out of so little. This section is all about how we do it. The details about how it actually happens are interesting first time around I guess; if I was you, I’d try to make a light read of it. Don’t worry about remembering the details, you can always look them up, but try to marvel at how meaning emerges from almost nothing.

What are binary numbers?

They are just a different way to write a number, a different encoding of the same abstract concept.

First, think about how decimal numbers work.

It’s positional, we work with 10 different symbols (digits), and the value each one digit represents is a combination of the value we assigned to the digit and its placement in the entire number.

So, 2 is two, 20 is twenty, 200 is two hundred and so on – we multiply the value of the digit by 10 each time it moves to the left. 2=2*1, 20 = 2*10, 200=2*100 …

When there’s more, we just add them up:

245 = 2*100 + 4*10 + 5*1

The pattern is: 10 digits, the largest is 9, smallest 0. The positions multiply these values by powers of ten, starting with 10^0 = 1, then growing as you move to the left.

3245 = 3*10^3 + 2*10^2 + 4*10^1 + 5*10^0

We can do positive integers like that, cool cool.

For negative ones we just add a – in front.

What about decimals?

Same thing, just keep reducing the powers as you move towards the right and remember that the negative powers actually mean one over the number to the power …

10^-1 = 1/10, 10^-2 = 1/100 ….

So:

3245.134 = 3*10^3 + 2*10^2 + 4*10^1 + 5*10^0 + 1*10^-1 + 3*10^-2 + 4*10^-3

Here it becomes woolly – not every number can really be written down like this. First, obviously, irrational numbers – pretty much by definition, they have an infinite number of digits after the decimal point, can’t even write them down as a fraction. They will alway be just approximated. Pi is one, root 2, root 3 … there’s actually more of them than all the other ones put together. (see Cantor’s diagonal slash proof for this, don’t have to but it’s pretty).

But it gets worse, there are numbers that are completely knowable, we can write them down as fractions, but not completely in this notation.

For example, ⅓. Dead simple, a third, alas, in decimal notation it will end up looking like:

0.3333333333333333…. Infinitely many 3s.

We have notation for this (a dot above the last three), but we cannot actually write it down in a finite number of digits. We’ll come back to this.

Let’s look at binary numbers now.

The thing is, you can actually use the same pattern for any number of digits – the placing multiplies the value of the digit by the power of one more than the largest digit.

For example, the octal system has only eight digits – 0,1,2,3,4,5,6,7 and the position multiplies the value by powers of 8.

Octal 234 becomes 2*8^2 + 3*8 + 4*1 = 2*16+3*8+4 = 32+24+4 = 62 in decimal notation.

Binary only has two digits, 0 and 1, and the position multiplies by powers of 2

Binary 100101 becomes 1*2^5 + 0*2^4+0*2^3+1*2^2+0*2^1+1*2^0 = 32+4+1=37 decimal

Woohoo – we can represent all numbers with just two symbols, two states, two somethings; one for something, one for nothing. We cracked digital … ok, there’s more a bit later.

Hexadecimal notation is based on 16, we borrow letters to mean digits: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f. (where a=10, b=11, c=12, d=13, e=14, f=15)

Hexadecimal a15d then becomes 10*16^3 + 1*16^2 + 5*16^1 + 13*16^0 = 41309 decimal

Same thing, different symbols.

Notice how when you have more digits the representation becomes shorter? Handy, there are compression techniques and cryptography algorithms that use base 64 – because the numbers involved are huge, so it makes files and strings in memory smaller.

All rules of arithmetic work in any number system, you add numbers digit by digit, if you get more than you can fit in that digit in that place you carry 1 over to the next place to the right … and so on (carry part actually causes all sorts of problems with computers).

Anyway …

Dealing with computers

Ok, so we have a representation of numbers with just 1s and 0s. You dig a little bit more into it and you see that 1111b = 15d (b and d are what you think they are, binary and decimal – there’s loads of different notations for this). But also, #f = 15d (# is often used to mean hexadecimal).

Sooooo, to represent all numbers up to 15 you need four binary digits and one hexadecimal digit. This actually continues up, so for numbers up to 255 you need eight binary digits and only two hexadecimal ones. It’s a super handy mapping (and is because 16 = 2^4). This is why hexadecimal numbers are popular – two digits for a byte, four for a word and so on. And the translation is super easy too, each hex digit maps to four binary digits so just translate them one by one in order and you’re done. Saves typing and lets you think in bits and bytes more easily.

You gathered by now that binary digit is used interchangeably with bit, but there is a philosophical subtlety: a bit is the smallest unit of information. It’s a very broad statement, it applies to physics, information theory, radio waves … all sorts. A single binary digit can represent one bit of information, but it is tied to humans, it is a representation of a bit, not the bit itself. Not that it matters before you start thinking about the structure of the universe, but good to know.

Why is a byte eight bits long? It’s purely technical – the size of the data we can move around at once along the bus, store in memory, deal with in one go was always going to be some number of bits – one per channel, one per flip-flop gate and so on. You want to minimise this, more width, more cost. But you still need it to be useful.

If you map every letter of the English alphabet to one number each, and add digits, basic interpunction and such you end up being able to represent written words with a number between 0 and 127 for each letter. That’s 7 bits. You do need some more stuff in there, special encoding for control sequences, extra letters in other alphabets, and also 8 is a power of two, 100 in binary, then you also need to encode opcodes for CPU instructions in one number each and 127 is tight … it all works out pretty nicely with 8 bits.

Later, when we could afford to grow these things, it became convenient to just double the size, then old 8 bit stuff still fits snuggly into cells 16 bits long then 32, 64 these days.

Ok, we made a call, our memory, CPUs, busses, all that is going to be 8 bits wide. (64 these days, but can’t be bothered to type all the big numbers later, it all works the same)

Nice, but … we can only represent numbers 0-255 with that.

To deal with larger integers, we split them into two or four parts, depending how big we need them to be. Two bytes goes up to 65535, four up to 4294967295. It becomes a little messy at the machine level assembly (to get a large number you have to go to the RAM twice, arithmetic rules become dirty because the carry part has to move between bytes etc). But it all works out in the end. It’s generally built into hardware – there is a special part of the CPU to deal with it all, ALU (Arithmetic-Logic Unit).

Many programming languages have different types for integers of different sizes because of this: short, int, long – the smaller you keep it the faster it goes. Rarely matters on a 64 bit computer, but good to know where it’s from.

Another problem is counting …

First, how does addition work in binary? Same as decimal, just smaller …

Say you have 101b + 1b

You write them one under the other:

101

001

—–

Start from right to left … 1+1 = 2, no digit for 2, so we write 0, carry 1 over

Next place 0+0 = 0, but we got to add 1 that we carry, so 0+0+1 = 1

Next place 1+0 =1

We get 110.

Now, say you start counting on an 8 bit computer 1, 2, 3, ….. Adding one to the previous number every time. Then you hit 255 on an eight bit computer … it looks like this:

11111111b + 1b

You do your usual arithmetic, add one to the last digit, oh, it’s 2, so we write 0 carry 1, add that one to the next digit, oups again, ok, write 0 carry 1 and so on, 0s everywhere and you carry the last one to one more place to the left:

100000000b

Nice BUT omg omg omg – this is 9 bits. Can’t fit into memory, can’t fit into registers, can’t fit on the bus. All that fits into the 8 bits is 00000000b.

This is called overflow.

The output that comes out is 0, because that’s all that fits into the 8 bits. The poor one that’s pushed out is gone 😦

CPUs generally have a special flag to denote that overflow just happened, but programming languages don’t often handle this; you just gotta be careful or you’ll end up counting up and then suddenly be back at 0 again.

Ok ok, we can live with that, we’ll figure it out … but what about negative numbers? Where do you put a minus?

That’s where it gets really weird. You use something called two’s complement … it’s clever but super boring.I read it a million times, worked it out with pen and paper a thousand times and forgot all about it within an hour each time. Here it is just for completeness but I would not worry too much: https://en.wikipedia.org/wiki/Two%27s_complement. In the end, the reason it works is because of this rolling over when you overflow … Back to that example above, if you ignore the overflow:

11111111b + 1b = 00000000b

So, when you add 1 to 11111111 you get 0: if you close one eye and ignore the last carry bit, 11111111 works like -1 does.

But, if you think it as an unsigned number, then

11111111b = 255d

The truth is, the machine don’t care what we wanna call or how to interpret a certain pattern of bits. We use some properties of these patterns and properties of the rules of manipulating them the best we can to model abstract concepts that have meaning to us. These concepts can be data, but data in isolation means nothing – it is actually mostly defined by what you can do to it or with it; so we also model operations/behaviour the same way. This point is pervasive; remembering it all the time will change your view of programming computers.

Encoding other data

We now know how to represent integers in a digital computer. There are limitations, but it works most of the time.

With that, we can do soooo much more (and this was the crucial insight by Ada Augusta, long before any of this was made real).

Text

For example … how do you represent text? Agree a convention by which you assign a number to each letter. By far and far and far the most common encoding is ASCII. (https://www.ascii-code.com/) it was made when computers were all 8-bit based, so we have numbers 0 to 255 to play with. It is an American standard, so only the English alphabet. Capital A is 65, B is 66 etc, then come lowercase letters. Interpunction and mathematical symbols and digits are in the range 32-64. 0-31 contains non-printable character, newline, carriage return, space, tab, vertical tab, and a bunch of weird stuff that made sense for ancient terminals or printer based computers (like “bel”, backspace etc) – the idea is you could sent a stream of these to a terminal or a printer and it would know what to do … we still use some, but they are really instructions for printers back then, more like automated typewriters. 0 is unassigned, it is used to denote missing data, end of strings etc. That covers everything 0-127.

128-255 contains special characters, like letters from a selection of other alphabets, but is far from standardised.

Clearly this is not enough for computers to be sold to other countries, you need more characters even for all european languages; the first thing that happened is that microsoft introduced this idea of “code pages” – a configuration parameter that changes the mapping of the letters in the range 128-255 to a different set. It’s terribly complicated, non-standard, awful.

Then when we managed to make 16 bit computers as standard, there was a committee of assholes who came up with this idea of a set of extended character sets that can take anything between 1 and 4 (or 8? Can’t remember) bytes – this is unicode. Depending on which “unicode set” you choose, a different mapping of the entire range across multiple bytes is used. Ok, you can now do chinese and sanskrit, but the encoding is a total nightmare.

What is used from it most often by far, most of the web uses it, is an encoding called utf-8. The first 255 characters are the same as ascii, above it you can have all sorts of foreign stuff, or symbols and things. Basically, most of the world by far is still using ASCII, and just call it something else.

This will be relevant in Python.

Bitwise operations and data packing

Many programming languages don’t let you do this easily, but it is relevant for understanding other things.

A byte contains 8 bits. Sometimes you only need 1, or 2, or 3.

For example all sorts of flags, is a feature turned on or off. A single byte can contain 8 of those, because each byte can be on or off. So you don’t interpret the bit pattern as a single number, but as a collection of boolean values (true or false). Some of the registers on any CPU will use this technique, it’s very popular in C.

More rarely, you can, if you are really desperate for space, pack other data in there, for example something that takes values 0-3 can fit into three bits, something that takes values 0-32 into five; so you can pack one of each of these into a single byte.

How do you set and read individual bits in a byte or a word?

You use properties of boolean algebra, or mathematical logic, or propositional logic … all the same stuff, goes back to athenian greek philosophy.

The idea is that you have things that can be true or false, and you can do operations with them. Operations are AND, OR, NOT and often XOR (exclusive OR). If you have two boolean values, x and y, then:

x AND y is only true if both x and y are true

x OR y is true when either of the two values is true (or both are true)

x XOR y is true if either of them are true, but not both

NOT x is only true if x is false

Generally in programming languages you actually have a separate type for boolean values, though in many you can just use it on numbers, 0 for false, anything else for true.

But … you can say that each bit in a byte is a boolean value (0 for false, 1 for true) and you can apply these operations bit by bit across the whole byte. This is bitwise logic, or bitwise arithmetic. Some languages have special syntax for these bitwise operations.

You don’t need to go into the details of this now, but you should just be aware of it. There is terminology there like masks, xoring etc. You can also shift bits left and right within a byte – gives super fast division or multiplication by factors of two.

People do all sorts of super fast stuff with this, and it is very popular in C, and most operating systems are written in C, so you see these pop-up often.

Decent guide for doing this in python, read later https://realpython.com/python-bitwise-operators/

Floating point numbers

Then we come to decimal numbers. Now that’s really hard.

We call them floating point numbers. The core idea is based on what calculators call “scientific representation of numbers”. Basically, you can write any floating point number in a format like this:

123.4557 = 0.1234557 * 10^3

0.000123 = 0.123 * 10^-4

The part on the left (1234557 and 123) is called mantissa, the power of 10 (3 and -4) exponent.

You can do the equivalent thing with binary numbers, except that mantissa is binary and exponent is the power of 2. Now take a few bytes (4 or 8 usually), assign a few bits at the end of it to store the exponent, the rest is mantissa. You have a decent representation that covers a huge range of numbers. There are generally two conventions for the size of them around, corresponding to two data types in most languages: float and double (most often floats are four bytes, doubles 8 bytes, but there are no guarantees) – the only thing that is guaranteed is that a double has at least as many bits as a float.

There are a whole lot of problems. Firstly, because of the limited size, you end up with numbers that you cannot represent exactly (remember ⅓ from a few pages ago? – same thing). And sometimes it is silly, for example you cannot represent decimal 0.1 exactly in binary. You will end up with 0.0999999999995, or 0.0999999999997 or something.

Secondly, rounding errors – again, because of the limited size of the mantissa, you will end up not having enough room for all the digits, so you have to round the numbers. The trouble is, once you start adding or multiplying or anything numbers that were rounded, the size of the error increases. Sometimes real fast. And you simply cannot get rid of this. You just have to live with it.

Thirdly, the rounding part is not consistently implemented in all the processors, nor in all the programming languages. So you will end up with slightly different numbers when you run the same code in windows or on linux, or even on different versions of intel CPUs, or in python and c++ and excel and so on. They are just a fact of life – the last few digits are always just assumed to be junk.

How many of the digits are junk? In theory you can work it out if you go through every computation that happens, but in practice there are rules of thumb and nothing much else. Because almost anything that needs floating point numbers will do lots of calculations with them.

Then, what do you do if your numbers are of a very different scale? Say you want to add 0.00000000000000000000000123 to 12324098204982098409840198324.1 … you’re basically fucked, the small number will end up being treated as 0. This is because the digits of the large number take up all the room in the result of the mantissa, the small number even though you can represent it on its own, has effect only way far to the right.

Finally, the one that I want to research – chaos. The definition of a chaotic system is that it is infinitely sensitive to initial conditions. This means that every single digit matters, but we can’t ever represent all the digits; we always lose information.

Having said of that … it does work for most things, but you have to be careful.

For example, in the early days of GPUs they could only deal with floats not doubles, simply not enough precision for trading models and till they upgraded their entire architecture to be able to do that, banks couldn’t use them.

Also, in your own code, you never compare floating point values for equality directly. You never do x==y if x and y are floating point numbers. Because rounding and all else means that the last bits are essentially random. You always do something like: abs(x-y) < epsilon , where epsilon is some small number, like 1e-12, or 1e-8 often.

There is a seminal paper on this, but I don’t recommend it for a while yet https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf .

Part 3: Hiding the Machines

Tata Says