The only important things when programming computers are thinking and the communication about thinking.
Not a very strong statement, I know, applies to everything; so why is it helpful?
Because we built all this knowledge and the machines themselves in a furious search for meaning, in a time when the meaning was crumbling all around us. Like we do every couple of thousand of years or so, we then turned inwards, looking to understand why we do what we do. We were naïve and instead of embracing the freedom to love and create beauty that the automation of the industrial revolution brought us; we tried desperately to use the mechanics of automation to reign chaos in, through simple, strict, and automatable rules.
The greatest minds of the time, true giants of thought, wasted away their sanity in a global effort to find a way to formalise and ultimately automate thinking. They failed, of course they did, but along the way we learned exactly how far we can automate it (up to the point where we need imagination and abstraction). Their efforts were valiant, even heroic: through them we were first freed of life sapping labour, and then of soul destroying tedium. It is stuff of legends, and legends of promethean kind; the legends of such power and utility that the utter failure they ultimately faced is if not forgotten, simply ignored.
Books, lectures, people, research … the entire discourse is flooded with accepted “wisdoms”, advice on “good practices”, procedures and processes, methodologies and recipes of ever increasing complexity that are all meant to ensure production of “good code”, by liberating you of the burdens of insight and creativity.
It’s bullshit, and it is overwhelming: safety of unbreakable rules is too seductive, a promise of prospect based on algorithmic certainty is too lucrative.
In the face of that, it is hard work to be convincing and to even remain convinced that if you want to build things that are good and beautiful, you have to understand what you are building; and that if you want it to make sense, you have to communicate about it (sense being made is an intimate experience of a communal event).
And so, it is actually helpful to keep in mind always that the only important things when programming computers are thinking and the communication about thinking.
We discovered programming languages as tools to translate those communiques into a physical, executable, form and actions; in as far as we can do it.
Along the way, we learned a whole lot about abstractions that we deal with when modelling the world in our minds; over time we came up with a set of core concepts that work really well when we talk about those thoughts that can be automated. It was hard, and we talked a lot about it; as a result, all programming languages share the same core concepts.
The only exceptions are when a concept is tied closely to a particular paradigm – for example, the pure functional programming paradigm has no concept of objects, so some functional languages won’t have support for classes or inheritance. There will always be an equivalent concept, even when it is not obvious, these are all equivalent variations of the mathematical ideal of an algorithm.
The concepts themselves can be viewed as either abstractions or ideals. When viewed as abstractions, each of them is a simplification of operations performed commonly on a digital computer. When viewed as ideals, each of them is a semantic form used to describe an algorithm. Whether this duality is an artefact of the history of development of computers, or of the existence of platonic ideals of algorithms (which is why computers work like they do), would be a matter of a heated philosophical debate in better times than the ones we live in. As is, allusions to the possibility of the platonic context often get laughed at (e.g. Penrose); but ideals are eternal, there is time. We will be paying attention to both interpretations, it helps dialogue to be aware of them, even when most participants are not even aware of the distinction.
Context
Computer code always exists within the context of a problem it is trying to solve, or a feature it is providing, or the section of the physical world it is trying to model … call it what you like, they are all equivalent. Context is what lends meaning to the content of a block of code.
At the machine code level, context is not in any way visible; naturally, because it takes a mind to assign or perceive meaning.
Programming languages deal with context in various ways in different aspects of abstraction. The most common implementation of it is scope – part of the code where a name is valid (a name of a variable, a function, a class … any name). All languages deal with this in some way: namespaces, modules, libraries, and implicit scopes in blocks of code are all examples of this.
Generally, when describing these, documentation talks about “variable lifetimes”, “grouping of functions”, “modularization”, or something similar; context is rarely, if ever, used directly as a notion. It’s a shame, because much confusion and arrogance stems from the lack of this understanding.
It is true that it is sometimes inconvenient to insist on the importance of context, because it opens doors for breaking conventions, but then, such is the nature of seeking meaning, and, ultimately, creativity.
On your journey through programming computers and, sadly, talking about it with other people, the examples of this will be emerging relentlessly.
Syntax and Semantics
Programming languages are parsed by computers. For this to work, the syntax has to be very strict. Some will insist on semicolons at the end of each line, some will insist on a very particular use of spaces and tabs, brackets, dots, arrows, curly braces, or other symbols will all have special meaning. There is no much room to exercise imagination in using syntax in programming languages. When you start learning a new one, that’s the first thing you need to muster, and then you just have to stick to it. Seems limiting at first, but we quickly find other ways to express form of our thoughts.
Semantics in programming languages refers to parts that are not as explicit, but have implied meaning in the context of machine level implementation. For example, passing values by value or by reference is called “invocation semantics”. It’s a little harder to learn, and it is limited to only the meaning within the implementation of the language itself, but it is generally well documented and it is about as strict as syntax.
We won’t spend any time on either of them here; we are not learning a particular language, we are diving much deeper than that, we are talking semiotics and meaning of things; metaphors and abstractions. We will be looking at how to tell stories about our domains and what we do in them.
Syntax and semantics of particular languages … you will pick those up from documentation, tutorials, and reading other people’s code. That’s the easiest part.
Data Types
Within any given context, there are categories of things. They are defined by their interactions with the context and with each, and they carry meaning within the context. Within a computer program, everything that you can name will have a type. Most obviously this includes data values, but you can also name functions, and other, even more abstract, things. Many programming languages deal with types automatically – they deduce them from the context of your code alone (e.g. Python); the deductions do sometimes fail and you have to interfere, but that is rare. In other languages you have to declare the types for everything explicitly; this allows compilers to slap you on the wrist if you didn’t think your abstractions through carefully enough.
In all these cases though, the types do exist and are as important to the code as describing the steps of the algorithm are. Often the boundaries are not even clear. For example, instead of writing code to sort a list of numbers or words, you can instead store the numbers (or words) in a red-black tree – a data type that stores data in a sorted order by construction.
The choice and the definition of the type of things in your code is the core of the design part of programming. This is how we model the concepts of the domain at hand, and this is the key problem with most poorly designed code: to do this part well, you have to understand the domain well.
This is not to say that there is always a single best choice of abstractions: understanding of a domain, modelling it with abstractions within a mind will depend on a point of view and personal experience. Much like no two painters will paint light in the same way but all great painters will make it clear that they are painting light by how they depict its effect on the scene, the shadows and reflections; a good choice of data types will be complete and consistent within the context. Elegance and simplicity will flow naturally then.
Sometimes choices are obvious given a prevailing discourse in the domain, and since code is text, and therefore intended to be read, these most often do get preference instinctively. You wouldn’t generally model humans with a type called Pigs, but then, Orwell did.
Object oriented programming is a paradigm that makes modelling of data types as the core activity completely explicit, and prescribes a set of rules that must be followed. On the other extreme, pure functional programming makes this completely implicit by pretending to ignore it. The difference is just a matter of formulation though: functions act upon data, and not all functions can act upon all data. Your code will only work, and your program will only make sense, if you apply suitable functions to suitable data. Whether you explicate it or not, this groups things into categories; and there is, for that reason, a concept of a type of a function (it’s called signature).
Mercifully, most programming languages provide features to switch between the two views and mix them at will.
There are attempts to formalise this (Category Theory), but they can’t deal with meaning and end up terribly convoluted trying to make up for that (and failing). Somewhere in there is a formal definition of an “abstract data type” that is useful, though: “An abstract data type is a set defined by the set of functions that can be performed upon its elements”.
Some languages implement this idea for type deduction, and call it implicit types, There, you never declare the types of your variables or functions, they are implied in the code. A common way to implement this is duck-typing: if it walks like a duck, and it quacks like a duck, it’s a duck. Python does this, it will deduce the types of your variable by what you do to them, without any need for declarations. In Python, types are also dynamic, a type of a variable can change, just assign something else to it. Newer versions of Python allow you to declare types if you really want to; few people use it, those who do are considered geeks by other geeks.
Most other prominent languages are much stricter about it, and you always have to declare the types of things and stick to them. C++ doesn’t let you change types of variables once it is known, but does allow for some type deduction (sometimes you have to declare types, most often you can just say the type is “auto” and it gets deduced for you from the type of the first value assigned to it). Compilers use this to check your thinking, by checking if you are using types in your code consistently.
The argument for implicit typing is that it makes coding easier, because this is not how we normally think about things or talk about them. It’s a sweeping generalisation, Plato and Wittgenstein would disagree vehemently. It does make coding faster, and when it works it’s cool.
The argument for explicit typing is that writing and reading code are both hard enough as it is, and implicit typing will never be perfect or efficient because it can do some simple deductions, but can’t know the meaning of the value; so you should be forced to think about these things and adhere to strict form. It’s a sweeping generalisation. Rushdie and Cohen would disagree vehemently. It does often help clarify the implementation details, sometimes even the meaning, and when it does it’s cool.
Sometimes sub-contexts change, and it makes sense to change types of values, to even convert a value from one type to another. This is called casting. For example, while reading a text file, “123” is a string, but if you want to double it, you would convert it to a string. (some languages have a stricter definition of casting, C++ has four different types of casts explicitly named; it gets annoying, so you can call it type conversions instead).
Because of the dependence on the context and meaning, it is impossible to make languages do this automatically in full generality. Some conversions are automatic, but only the simplest ones, or ones you specify yourself as being sensible. Often, having to cast in your code is a warning that you didn’t model your domain correctly to start with – it may mean that the original type is off; it is good to keep this in mind for sure. However, there will be times when it is sensible, but casting is often frowned upon by default… a cast in the code means you have to think about why it is being done; and thinking scares people. I wouldn’t worry about it too much, as long as you are clear about why you’re doing it.
Data Schemas
As with everything else, describing one thing in isolation is rarely meaningful. The ability to describe all data types relevant to a context, and their relationships, is so useful that it is the most common starting point for modelling a domain. It makes sense too: before thinking about what we want to do with things, of course it helps to describe what the things are and what can even be done with them.
These descriptions of the entire relevant set of types within a domain are called data schemas.
Within programming languages they are implicit and generally not easy to separate from code that implements them; it’s a shame, but a fact of life so far.
There are languages that are specifically designed to describe data schemas, like XML schema, JSON schema, UML etc. Some can be used to generate graphical representations of the schemas, or even code that implements them in other languages; some can be used to verify that data complies with the schema automatically. They are very useful, but ultimately a schema is a pure abstraction and often described in tables, pictures, or human language.
The key point about them is that they pin down the terminology and metaphors that are to be used when communicating about a domain. They are never unique, describing a thing never is; so settling down on one helps clarity of communication hugely. If you are careful when putting one together, it would also be helpful, it would make sense to most people involved in the conversation.
Sometimes they are also called “formats”, and this is intuitive; I prefer to think about the format as a set of syntactic rules though, and schema as an abstract concept that can be implemented in many formats. XML is a format, but you can implement any data schema in XML; and on the other hand, you can implement the same schema in JSON, SQL database, Python …
Implementation of Data Types
At a machine level, everything is a bit pattern; at a push you can argue that there are only a few types in most assembly languages – smaller integers, bigger integers, and floating point numbers. Some provide a bit of special handling for limited arrays or stacks too. But that’s it, once at the level of assembly, your beautiful data types are all gone.
From that perspective, a data type defines a layout of bytes or words in memory that is applicable when instances of objects of a certain type are created or stored. Taking this into consideration can improve code efficiency significantly – the closer the structure of the data within a type to something that maps easily to machine concepts, more efficiently it can be stored and dealt with.
This creates a grouping of types.
First group is PODs (Plain Old Data) – these are representable within CPUs without any translation; integers (signed or unsigned), characters (because of ASCII mapping they are numbers), floating point values, Booleans (0 or 1), and pointers. Languages often add a layer of abstraction on top of PODs in order to optimise speed or storage. For example, to represent a very large integer, you need 64 bits, but on a 32 bit machine this is not directly representable – you need two words to store it (a “word” is a somewhat flexible unit of memory size. Most often it means as many bytes as you can fit on the bus, so two bytes on a 16 bit machine, four on a 32 bit one etc). Similarly, you only need one byte for an ASCII character, but you can fit four of those into a single memory cell on a 32 bit machine; even worse with Booleans, you only need one bit for those. Languages hide the details of this packing and interpretation from you and often provide either a single type for all integers, or different types for short and long integers. You try to use the most appropriate one, the compiler does what it can to optimise it for the target machine by generating assembly that uses the most optimal storage. Many languages don’t bother with these kinds of optimisations, making them less efficient but then you often don’t care. Some give you a lot of control over this kind of detail, for when you really really need it. Generally, it is a matter of discipline – if you can, and have headspace, and the language provides this feature, you choose the type of POD that will fit your requirements and no larger. When you really care, you choose a language that lets you do this easily (C++ most often, or C).
The second group is generally referred to as arrays, but really anything that can be stored in RAM as a contiguous block of memory cells, they are all arrays of PODs: vectors of integers or floats, strings, arrays of Booleans.
This group is often considered together with slightly more complex data structures, like lists, maps, queues and so on; there is a set of these that is so common and so well researched that the implementations are taken as standard, thought in school and are part of the culture. Using them is a necessary part of the toolkit. We will deal with them in a separate section.
The last group is everything else.
Languages that target efficiency provide direct access and special treatment for PODs and arrays as part of the language definition. C and C++ are superb at this. Many languages ignore this and implement all types in the same way, through some internal higher level abstraction, and then translate them internally. They are said to use “boxed types”, it slows them down, a lot. Python does this. Java provides both – it has a boxed and an unboxed version for each POD type, you can pick which to choose and confuse your audience at will.
Modern compilers and interpreters are generally pretty good in figuring this out for you and generating quite efficient representations for even very complex data types; but this is often context dependent and you can optimise it a lot by arranging your code carefully. This is fun, and a responsible thing to think about, a step towards beauty. In isolation though, it can be done quite well without much thinking or any imagination: complex as they may seem, the technicalities are far easier to deal with then semantic idealism. Because we live in the times when proficiency is all, lazy of mind will find a source of self-importance in this sort of exercise far too often.
Enumerations
Often a data type represents an enumerable set of values. Typically small in size, but what’s important is that we can list every single possible value. For example colours, or a set of possible events in a trade lifecycle in an investment bank.
Some languages provide special syntax for specifying these types. In others you typically specify them as names for numbers; for example Purple is 1, Pink is 2 and so on.
Even in the languages where these are supported directly by syntax, compilers eventually always translate them to such mapping to integers. It is very efficient.
Pointers, References, and Iterators
A pointer is an abstraction of an address of a cell in RAM. When the operating system is not present, the assembly code would be littered by actual values of memory addresses to refer to data or code (both are stored in RAM). We need an abstraction for this because of how operating systems work. Because operating systems manage internally where program instructions and data will be loaded, and because of heap memory allocation (including dealing with virtual memory); we don’t really know where instructions or data will end up during the execution of a program. We need pointers to keep track of this. The data content of a pointer itself is just a number, a value that represents a memory address. However, a pointer type is not the same as a number type.
The first difference is that in most cases, programming languages provide distinct types depending on the type of data that the pointer refers to within the current context; a pointer to an integer is not the same type as a pointer to a string. This helps maintain the link to abstractions within the context in which the code is written, and compilers use this extensively to check code for mistakes.
The other difference is arithmetic: adding one to an integer means getting the next number, adding one to a pointer means making it point to the next memory address. This is generally not the same because a small integer may be represented by (for example) 16 bits, but on a 32 bit RAM chip addresses are always 32 bits long, adding one to a pointer then actually adds two to the underlying number. It can get even more complicated because of memory fragmentation or virtual memory. Because RAM is a busy place, blocks of it get allocated and deallocated all the time; in the end the layout of free memory vs occupied memory is terribly patchy and unpredictable. This is memory fragmentation. Both compilers and operating systems hide this from you. They try to allocate memory in contiguous blocks of the size you need, but if they can’t, they sometimes break it into smaller pieces and then inject code into your code that will correctly move pointer values from one cell to the next one. Iterating through blocks of memory is the most common thing computers do (more on this later), and because of this all CPUs implement, in hardware, very fast instructions to increment or decrement a number by one. C introduced special operators in the language to invoke these instructions, ++ and –, still very popular in many languages. Similarly, adding a constant value to a pointer is done all the time (to hop to an indexed cell in an array), operators += and -= deal with this.
When dealing with pointers, it is very helpful to have a value that denotes an invalid pointer. It will become clearer when we talk about data structures, but when working with pointers a lot, it is important to know if the address it points to is sensible or not. Zero is a perfect value for it; because every RAM will have a zero cell so it is a valid value, and even more so because Boolean values are pretty much always encoded as zero for false, anything else for true. Then the meaning of a “null” pointer is also “false”, as well as “nothing”. It feels right, and it caters for idioms in many languages where it is possible to treat a pointer as a Boolean and say things like “if pointer”, meaning “if the pointer makes sense”. Succinct nihilism. This is called a null pointer. Languages provide thin abstractions on top, keywords like “null”, “nullptr”, or “none” , to distance you from the hardware and to allow for different implementations; but the concept was borne out of use of pointers and memory addresses.
As abstractions go, this is all very low level. It is all about storage of data in digital computers, and nothing to do with the meaning of algorithms or their expression. In fact, having to deal with them forces you to jump between contexts – the one you are actually working in and the one concerned purely with the operation of the machine. They almost always pollute the relevant context. So why do we even have it?
Because of efficiency. We are not talking subtle improvements here either; these considerations too often make a difference between usable and unusable applications. Data needs to be shifted between sub-contexts all the time (most commonly this is between functions, but not always). Because humans deal with complexity of algorithms by breaking contexts down into smaller ones, this passing of data around ends up being by far the most common single task to perform. The natural approach to this is to simply create a new copy of it for a new context, but because heap memory allocation is complex, and copying data from one place in RAM to another often means sending it down the bus to the CPU and from there back to RAM, or even disk; this uses up energy and time that are nothing to do with our algorithms. Not to mention doubling the amount of used memory. In the end, we are talking about orders of magnitude savings in energy, speed, and memory if instead of passing the entire block of data that we need we pass around a pointer to it.
The two methods are called passing by value and passing by reference. In most cases, passing by reference actually means passing a pointer around, even though many languages do this under wraps.
References
Reference is a more general term though. It is an attempt to abstract pointers away. It is only partially successful – you still need to think about the fact that this is far more efficient than passing by value, but you don’t need to worry about the mechanics of it. In most cases you can think of a reference as simply a new name given to the same data in a new context. Then the language compiler will add assembly code to your code that deals with pointer arithmetic.
It is a relatively new concept. C, being old, does not have a concept of a reference implemented explicitly, you only ever deal with pointers directly. It’s not so bad really, but you need a lot of training to not be distracted by constant context switching you have to do while writing or reading code. C++ is an extension of C, so it still supports pointers, but it adds references too. Java is intended to run on any architecture you can think of, and even across machines. There you cannot use pointers because data could be stored in memory of different boxes and a memory address is not enough to find it (there is potentially more than one RAM). Java has no pointers at all, just references, and they are implicit: there is a set of rules in Java about which objects are passed by value and which by reference and it just happens. If you want to pass something by value, you have to (quite manually) create a copy yourself. You can still create references explicitly though. Python goes a step further, it simplifies the rules about what is passed how and you never see any of this. You still have to remember it though, both because of efficiency and because of constness.
Constness
Passing data by reference means that the new context has access to the actual data from the original context. This in turn means that it can change it. Life is a lot easier if this doesn’t happen; it is hard to keep track of what got changed where when there is a lot going on. Mostly you don’t really want this to happen anyway, because you split contexts by parts of the computation you want them to do and you are interested in a result of that computation, and in most cases data that is used as input for it is not the part of the result. This is why such changes to data are known as side-effects. Assholes smirk when they spot them in code, but like everything else, they are not always bad, just hard to think about.
Most languages introduce a notion of constness because of this. You can declare variables or function parameters as “const” to make this explicit. Compilers can then enforce it; writing code that attempts to change a constant value is an error. We need this because we forget, and we want to forget, one context while dealing with another. In addition, knowing that something is constant allows compilers to make many optimisations. Semantically, making things const simplifies thinking because it clearly separates inputs to an algorithm from outputs. It is not always feasible or desired, there are cases where you actually want your data to keep track of changes it’s been through; a whole lot of algorithms work like this, and there is a beautiful part of theory of computation that studies these and abstracts them as finite state machines. Regular expressions are implemented like this, and so are most parsers of text, including compilers; all are examples of finite state machines.
When it can be done though, constness does make things a lot easier to work and think, and lets compilers optimise a lot more. To be able to forget about it better, a good rule of thumb, in languages that do allow you to specify constness yourself, is to always start by declaring everything as a constant. Then you only remove constness as a conscious decision; it sticks out in code like a sore thumb and that makes it less likely to forget. I try to remember to add a comment explaining why I do it, whenever it is not completely obvious why something is not constant.
There is also often deeper meaning implied by constness; I once implemented a beautiful algorithm to deduce if a traded product payoff depends on intermediate time points or just the values observed on the settlement date by exploring the constness of values in the code that describes it.
Some compilers do this sort of analysis too; they use graph based algorithms to deduce if things are constant in your code even if you didn’t declare them to be; then apply their optimisations. In fact, some languages completely disperse with the concept of constness and completely rely on this sort of analysis (e.g. Python, all functional languages). It makes code semantics closer to story telling about the algorithm (because constness is ultimately an artefact of either hardware implementation or personal discipline). You do give up on the ability to be helped by the compiler when you don’t want your data to change, and this does make things harder sometimes; you forget, bugs creep in because your data changed when you didn’t expect it to. But your code is prettier to read.
Iterators
Dealing with pointers directly in early languages introduced the idea of traversing data easily by simply using ++ and — operators: to get to the next element in an array, all you need to do it increment the pointer to the current one. Many idioms in programming languages were developed on the back of this. Iterators are an abstraction of this. Most languages provide either built in iterators or a way to implement your own. These are objects that you “point” to a data structure and then you can access contained data through the iterator and you can traverse the container by some version of “next” or “previous” operator. They are very standard and further abstractions are often built on top of them.
Are pointers bad for you?
I intended to make the section on pointers very short, but while writing it I realised just how much they affect the storytelling power of programming languages; and I had to explain why and how and what is the best we can do about it.
The most interesting part is: why didn’t I realise quite how bad it is before I even started? Because it is not actually that bad. It breaks the purity of languages and adds additional demands on imagination, but all it really does is make it clear that the context of what you are modelling with code does not exist in isolation. It extends the context to include all the relevant aspects of it. You never really implement just trade pricing, you implement trade pricing on a digital computer using algorithms.
Much like good storytellers would extend the context of the story to include the concerns about audience, occasion, or medium they are using. The context is not made impure, it is enriched. All is good.
Commonly Used Data Structures
Data structures are a further abstraction of data types. Historically they came first though, and they were studied in great detail, taking into consideration efficiency of implementation. They were early patterns noticed in coding and most were first implemented in assembly. They generalise techniques for organising data in memory of a typical digital computer, and they are part of the lore – most algorithms will be described in terms of some of them. There are good reasons for it: on the one hand, choosing this wisely will allow you to exercise greater control over efficiency of your algorithms, and on the other, through experience we learned that they are very good building blocks for very many use cases of abstractions. Partly, this is because often these choices do not actually affect the higher order abstractions in your mind – a bag of apples is a bag of apples; so you might as well pick the shape of the bag that is easiest to use and has the most efficient representation. Partly it is because they in fact model abstractions that we seem to anyway use when thinking about piles of things; like pure ideals of structure.
There is a lot written about it, much of it very dense because it is based on research, or because people try to explain it in terms of pure technology, and that is not really possible; their meaning is why we use them. There is a very famous book “Algorithms + Data Structures = Programs”, by Niklaus Wirth (considered an influential genius, but I find him very boring and his work derivative). I gave up reading the book, and those who read it didn’t have much to say about the content. The title of the book though is a gem. The realisation, while not his own, is enlightening for everyone who writes code, and the formulation is succinct, the whole thing is much like a minimalist Zen koan. The title itself was far more influential than any of his other work, or to be fair, work of most others. So, while I see him as a one-hit wonder, let’s give him credit for that one hit: it was really good, and it is a very good thing to keep in mind whenever you start designing your programs.
Data structures are abstractions of higher order than data types, strictly speaking; because to be fully specified the contained types need to be specified (an array of integers is not the same type as an array of strings).
Literature often talks about the efficiency of operations performed upon the containers in terms of “O” notation. This is nomenclature for expressing how the speed of the operation changes with the size of the container. For example, arrays provide constant access time means that regardless of the size of the array, accessing i-th element always takes the same time. To change the size of the array on the other hand takes n^2 or n log(n) units of time, where n is the size of the vector. In a list on the other hand, access to i-th element grows linearly with the size of the list on average, but increasing the size takes constant time. It very quickly gets tedious; but it is used, for this and for algorithms too (efficiency of algorithms is measured in the same way, by how much slower they become as the input size grows).
Arrays (aka Vectors)
These are indexable containers of things. They are a kind of random access container. You can always write or read any member of it that you like, as long as you know how far it is from the beginning – its index.
As long as you know how long they are before you start using them, they can be implemented to be incredibly efficient. You allocate a contiguous block of RAM the same size as your array and you offset every index from the first allocated cell. Then reading and writing data in an array is a simple matter of reading a known place in memory, the only cost is a single addition (first cell + index). Iterating over them is even faster, start from the first cell and keep adding one to the address. It’s all just pointer arithmetic in the end, and computers are built to make that as fast as possible. They also only use as much memory as you need, nothing extra is needed.
The trouble is, if you realise during execution that you need more room in your array, chances are there is something else already using the RAM cells behind your last cell. You then have to find a new place in RAM for the whole thing of the new size (and RAM gets very fragmented, ain’t easy fitting things in), and then copy the stuff that was already there from the old place to the new place, and then release the old memory. This is called re-allocation and it is slow(ish).
You would use them when you need random access to things, lots of traversals, and you can have a good guess about the size of it in advance; or at least you know you won’t be changing the size often.
Strings
Strings are text, and are always given special consideration, because we need words, a lot.
We process words differently than anything else, and we think them differently – individual letters of text rarely have a meaning or purpose on their own. Mostly we deal with them as a whole. The only exception is when we need to make sense of them in the code, and we do need to do that when we are parsing programming languages, for example. Whenever we do it though, it is only the first step – making sense of what we parsed is going to be slow anyway, it’s just hard. We need to be able to store them efficiently and traverse them quickly; they are worth it.
Strings will always be implemented as arrays of characters, and a character is always encoded as a number. The only difference in the implementation between them and any other array is dealing with the length of it.
We need to know the length because even though we think of strings as a whole, machines have to process them one letter at a time, they have no choice. But having to always keep track of the length of text would break the beauty of the abstractions in the text of the code.
What happens in pretty much all languages when you create a string is that they allocate an array to hold it and hold on to the pointer (address in RAM) to where the start of the array is. There are two ways to mark the end of the string.
Many languages (C, C++ included) simply allocate the array one larger than the text you supplied and stick the null character at the end of the text (ASCII code 0). Null character is assumed to not be valid in any context and so all algorithms can now iterate from the start till they hit 0.
While null terminated strings are quite beautiful and flow wonderfully, there are cases when you want to store zeros in a string container (for example encrypted data may contain zero bytes, there are also some text formats on the internet that use two zeros as line separators). This breaks the paradigm, and while it is still possible to deal with it, it’s additional work and you have to rewrite standard functions yourself … it’s a pain and it makes code ugly in those cases.
Another way is to store the length of the string somewhere and write functions, or even built-in commands in the language that access this length for you; you mostly never really see this but it happens in the background. Languages that use boxed types will have an internal data structure to represent anything anyway, so it’s easy for them – one more field in the record for the length. Other languages allocate a couple of extra bytes and then store the length in the bytes before the text starts. Basic does this, so does C# and some other Microsoft technologies. Nice try, but it turns out that in many cases it actually makes code even uglier (because, again, we don’t think of words as having length, we want the meaning).
Stacks
You can only do two things with a stack: push and pop. To add a thing to a stack, you “push” it on the stack. To get a thing from the stack, you “pop” it off the stack. When you pop, you get back the “top” of the stack – the last thing that was pushed; and that thing is then removed. So if you pop again, you get the second last thing that was pushed. You can keep stacking things up, popping them off the stack to your heart’s content. For some reason, all the books explain that it is a LIFO data structure – last-in-first-out, last thing that goes in will be the first thing that comes out. No idea why anyone really bothers with that mnemonic, I never heard it used, or used it, but it’s in every book, taking up space just like it does here.
Most of the time, stacks are terribly inconvenient. To use your data, you have to know the order in which it was stored. In the daily life of a programmer, you use it so rarely that when an occasion actually calls for it, it is a happy day; like spotting a mythical beast.
Why do we even bother then, and why is it such a prominent spot here?
Because they are so incredibly useful at assembly level that almost nothing computers do is done without them. Every CPU supports them directly in the machine language. There are actually entire computation models called stack machines that use nothing but stacks, people even built CPUs that have different architecture and only work with stacks (didn’t go very far, but weren’t bad, just too idiosyncratic). Much of this then seeps into daily usage and you use that all the time (calls stack, stack memory etc). They are a crucial part of how programming languages are implemented and executed at the machine level, and you need them whenever you start thinking about the execution model for your program, which you do whenever you debug things.
Why is this then?
The answer is dense and at times boring, rarely covered, but so worth it – this is how all functions and most variables are made to work.
Firstly, they are incredibly easy to implement at the machine level. Every CPU has a special register called “Stack Pointer” (SP). Then, when you want to use a stack, you allocate a piece of RAM, and choose the starting point (either first or the last cell). You store the address of the starting point into SP. Every CPU also implements a number of “general purpose” registers, just super fast memory cells on the chip itself that can be used as variables in assembly code (normally named A, B, C …). Most CPU instructions act on registers (for example, INC, increments register A by one).
CPUs also implement, in hardware, instructions PUSH and POP.
PUSH 234235 will:
- Look up SP, and put 234235 in the cell that SP points to
- Increment SP by 1 (so that it points to the next cell)
POP will:
- Look up SP, get the number from the cell that SP points to and put it into register A.
- Decrement SP by 1 (so that it points to the previous cell)
If you really want to push something that is bigger than a single memory cell onto the stack, you break it down into bytes and push it on to stack in parts (in a number of pushes), then you recompose it when you pop it.
That’s it, and it’s very fast. CPU architects even optimise hardware to make this even faster.
Secondly, this described implementation is so standard that you can cheat to make it more convenient and faster to use. Even though the stack abstraction doesn’t allow you to access anything other than the top of the stack; in the implementation it is just a block of memory, and you always have a reference point into it. So if you know you want the thing that was pushed on the stack three pushes ago, you just get the thing that is at the address SP-3. Kind of like an array, but with a moving start element. This is most often how local variables in functions are stored (hence the name “stack allocation”) – more details later. It ain’t pretty or easy to read and write code like this, but it makes things execute much faster. On the other hand, very few people ever read or write assembly code, and there are always just enough autotelic obsessives who sacrifice their lives writing compilers and low level machine code that we all then use blissfully. In the community they have a saintly status, in recognition of both their sacrifice and their isolation; doing this for living does things to people that make it hard for them to relate to other humans.
Thirdly, simple as it is, it is an abstraction layer over RAM: your assembly code accesses memory through SP, PUSH and POP, and it doesn’t need to know where all this is exactly in RAM. This is a lot easier than writing code that always has to keep track of individual RAM cells. To be clear, you can use general purpose registers and whatnot to implement other abstractions, and, of course, people do (most data structures discussed here precede programming languages); but this one comes pre-packaged and optimised, so it’s used all the time.
Finally, whenever there are many changes of context that happen in an ordered way, and then again in the reverse order, stacks are a perfect data structure to use. Functions are an example of this: one function calls another, which calls a third one, which then returns the value to the second one, which then does something with that value and returns a value back to the first one, which does something to this return value … and this can easily go hundreds deep. We will talk about this more when we talk about functions, but storing the context of the current function on a stack while calling the next one and then popping it back on return is a wonderful way to do it. This is how functions are (almost) always implemented at machine code level, for any programming language.
In summary: stacks are the only abstraction you can always expect to see at the assembly level out of the box, they are insanely fast, they are the go-to solution for implementing variables and functions in all programming languages. What’s not to like?
Lists (AKA Linked Lists)
Lists are, like arrays, homogeneous linear data containers. The difference is that they do not provide random access. Homogeneous means that all elements are of the same type.
Every element in a list is contained in an intermediate data structure, a simple record that contains the element data itself, and a pointer (or a reference) to the next record. This way you chain as many as you like, and make the last one’s pointer a null one, to mark the end.
You can traverse this easily: start from the first one, to get to the next one, follow the pointer in the record, and so on till the pointer in the record is null (you reached the end). Adding a new record is also easy and relatively fast: create a new record somewhere in memory, make its pointer null, and then just change the pointer in your list’s last record to point to this new record. It’s even easier to prepend things: create the new record and make it point to the first element in the list; the new one is not the first but nothing else changed. You can even insert them in the middle: create a new record, then slide down your list to where you want to insert your new record and just arrange the pointers accordingly. Dropping too: find the record you want to drop, free its memory and make the pointer in the record before it point to the record after it. Smaller records are easier to fit into RAM, and for all these operations, you never need to move data, just change values of a couple of pointers.
To access the nth element though, you have to start from the first one and count your hops. The longer the list, the longer you average access time. Also, accessing elements requires a few operations for each step.
All that being said, compared to everything else ever, traversing data is so much more frequent that there are languages that implement no other data structures at all (LISP), and those where lists play such a central role that you rarely touch anything else (Python).
There is also specialised terminology because of this: the first element of the list is called head, and everything other than the first element a tail. Because a list is like a snake, funny haha. Ok, ok, it’s because we needed names for it, because you can write algorithms as recursive functions using lists: to process data, you pass a list to a function, apply processing operations to the head, then invoke the same function with the tail, until the tail is empty.
Turns out you can write all algorithms using just this and a couple of basic arithmetic operations. LISP doesn’t have any constructs for loops, just recursion like this. Neither do any of the functional programming languages.
An important variant is a doubly linked list, same thing but two pointers per record, one pointing to the previous element, one to the next. Then you can then quickly traverse it back and forth.
You would use them when you are dealing with data that you will mostly traverse or pass around as a whole, and when you know that the size of the data will be changing a lot. Or when it makes things prettier. There is elegance in the simplicity of lists, and many languages provide special syntax for dealing with them; they become an important part of idiomatic expression in those. Python does this a lot.
Tuples
Tuples are collections of elements of different types of known size. The types of elements in a tuple is fixed.
A tuple of size two is known as a pair. A tuple of size n is known as n-tuple.
Type of a tuple is a tuple of types of its elements. For example, a 3-tuple that contains a string, a floating point number, and an integer is (string, float, integer). This is not the same type as (float, string, integer).
You access elements of a tuple by position – first, second, third etc, like you would in a vector; but you have to know what is the type of each element. Because the type of a tuple depends on the types of its members, you cannot change the size of a tuple, it becomes a different kind of thing.
They are high level abstractions, there aren’t any simple or universal ways to implement them. Languages that provide them all deal with it in different ways.
This makes them hard to generalise, and while they are useful, they are not used very often. Generally there are better abstractions to use instead. Ones where you name elements of different types, instead of relying on the order, and where you can provide more of your own semantics and operations to deal with them. These are classes and structures, we will talk about them later.
Tuples are far more useful for thinking about things than for talking about things, or implementing things. When they are used, most often it is for convenience, meaning because of lack of time or will to do something better. The only situations where they are probably justified is as means of implementing other, higher order abstractions where truly all you care for is types and their relative position; I am not convinced, but that is when they pop up the most.
Trees and Graphs
Graphs are as general as data structure abstractions get. They are just some nodes connected by lines (edges). Data can be stored in the nodes or in the edges; it doesn’t matter (one can always be transformed into another). All other data structures can be viewed as graphs with some specific restrictions. They are studied in detail, there is an entire branch of mathematics that deals with graphs (erm, Graph Theory). Phenomenally, even though they are so generic that you can use them to represent anything that contains concepts or data somehow related to each other, we can still think about them a lot and come up with lots of information about their properties and many algorithms do deal with them. They are incredibly useful whenever there is a lot of data points that are connected in complex networks; for example Excel is as efficient as it is in large part because it keeps track of the dependencies between cells in a large graph (dependency graph) and has ways to traverse it quickly to only evaluate those cell whose value would change; all GPS software represents maps by storing locations in nodes and connecting them with “weighted” edges (each edge has a value associated with it, distance in this case); compilers and build systems like make and ant use dependency graphs to track dependencies between files and modules so that they don’t have to re-compile thousands of files in a large project when you change one line in one of them.
Trees
Trees are a subset of graphs; the nicer ones amongst all graphs. The key properties are that they have a single root node, and no cycles exist (you can’t start from one node, follow the links/edges and end up at the same node again). This means that there will be nodes at the end of the tree that are not connected to anything other than their parent node. These are called leaf nodes. Because they have this simpler structure, they are used a lot more than graphs.
Trees are also simple to implement, much like a linked list, except that every node can point to a number of nodes, not just one.
Traversing Trees
The reason the single root and absence of cycles matters is because you can freely traverse a tree, visit each node in turn, and know you visited every one and didn’t end up chasing your tail in a loop. Chasing your tail in a loop is really bad – so bad that this was at the core of Turing’s most important revelation: the halting problem. We’ll talk about that later, but the bottom line is – you, provably, cannot develop an algorithm that will tell you if you are running in a loop or not; you are destined to end up just running around forever. It is one of the core results in theory of computation, and a likely reason why general AI ain’t gonna work.
With trees, you just don’t get that, phew. There are two ways you can traverse a tree: depth first and breadth first.
Here’s a stolen picture about it:
Breadth first means:
- Start from the top.
- Go down one level and visit all the nodes at that level, left to right.
- Go to the next level, repeat.
- Keep doing it till you run out of nodes.
Depth first means:
- Start from the top.
- Go all the way down the leftmost path you can find. You’ll reach the leftmost leaf.
- There are different conventions, but most often you actually assume your traversal starts at this first leaf, you will visit the parents again anyway.
- Then go level up, and down the next edge from that node that you’re at, back up, back down, left to right, going all the way down each time, till you have visited all the leaves you can reach from that level.
- Then go back up and repeat, and again and again, until you run out of nodes.
Breadth first is easier to explain, but because you naturally implement trees as a node that contains pointers to its child nodes, depth first is much easier to implement; too early to talk details, but a few lines of code in a recursive function.
Flattening a tree is a term sometimes used to mean doing a depth free traversal but along the way storing each node in a list as you visit it. Then the list will contain nodes in the order in which they appear.
Why would you want to do any of this? It will become much clearer once we talk about some special trees.
Types of Trees
There are many subtypes of threes and associated algorithms, and many other data structures are implemented as trees with certain restrictions.
The most commonly used type of a tree is a binary tree, a tree in which each node has at most two sub-nodes.
A heap is a binary tree where each node contains a value that is less than or equal to the values of its child nodes. There are efficient algorithms describing how to add a new node to a heap (or remove one from it) so that this condition always stays true.
Heaps are used to implement priority queues: take a list of tasks and assign a priority to each one. Then add them to a heap one by one. Then to figure out the order in which to execute the tasks, just traverse the tree depth first. The first leaf you hit will have the highest priority, its brothers and sisters will be around the same priority, but all more important than their parent, their parent’s parent and so on. And this is the order of visitations in a depth first traversal. Voila. Also, in general there is no faster way to do this, someone proved it.
A binary search tree adds another restriction: nodes have to also be sorted left to right. So the leftmost leaf is guaranteed to be the node with the smallest value in the tree. Again, there are algorithms to do with building such trees efficiently. Package data into one of these and then flatten the tree, and you sorted your data (again, can do it with less memory, but not faster, and this is easy). Also, if you are looking for a particular value you just traverse depth–first and stop when you found it, or when you run out of nodes. This is equivalent to a binary search (aka bi-section search) algorithm. Our first example of a beautiful fact that data structures can in a way embody algorithms, or at least some of their properties (Algorithm+Data Structures=Programs, yeah!)
Red-black trees are binary search trees, but the algorithms to build them also do re-balancing; these trees are sorted and balanced. Balanced means that you don’t end up with one branch 1000 levels deep and the other one two levels deep. These are nicely flat and boring. It helps with a binary search – you won’t have to end up going down thousand levels and back up only to realise the value you were looking for is the first one on the next branch. (There is one case when you would prefer to go thousand levels deep, Cohen has a song about it.) Maps, aka dictionaries are implemented like this.
There’s more, but these are by far the most important ones (other than DAG, but we’ll look at that separately). There are also all sorts of theorems about them, algorithms to transform any tree into a binary tree, ways to check if they are equivalent and so on. They are very generic too, for example, a list is a special case of a tree (unary tree?). Most things are trees.
They are a subject of almost an entire massive tome of the most famous work in computer science, “The Art of Computer Programming”, by Donald Knuth. This is a work in seven tomes, he started writing and publishing it in 1962, he published volume 4B in 2022 and is still working on the last three. Propper weirdo, but sweet and very smart. His other work included TeX (LaTeX), lots of curious mathematics, and works on theology (his drug of choice is his Christianity, he is a devout Lutheran). He has an entire chapter on whether you should draw trees with the root node at the bottom or at the top. He also documented most of what we know about trees and how we still use them to day.
DAGs
There is a related notion to trees, DAG (Directed Acyclic Graphs), these are trees but with the addition that all edges also have a direction, normally in the direction from child to parent and not even implemented, just assumed. DAGs are also often called “dependency graphs” and are used all over the place. They don’t have conditions on the priority and sorting of nodes. You can add them, but it’s rarely done. The trick with them is that they are used to represent dependencies of tasks – each child depends on its parent (hence the direction). Once you arrange your tasks into one of these, you traverse it depth first and execute tasks in the order of traversal; this way you are always executing children tasks before parent tasks, and therefore in the order of dependency.
A common example is compilation: in a large library there will be thousands of files that depend on each other. Build tools like make or Ant first build a dependency graph of files, then compile them in that order.
Another example is Excel – it keeps track of the functions you type in cells and how they introduce dependencies on other cells. Then when it evaluates a sheet, it looks at which cells changed, and only evaluates those cells and the cells that depend on it. It makes it orders of magnitude faster.
Dependency Graphs in Banks
DAGs are also used in large American banks. They build entire languages and frameworks to make use of the same idea that Excel implements.
Here, each function in a program is taken to represent a node on a graph. Functions are restricted in their behaviour in that they have to be “referentially transparent” – this means that if you call them twice with the same parameters, you are guaranteed to get the same result.
Then within a specially built (or adapted) compiler, a dependency graph is built, connecting function calls to each other: when a function calls another function, an edge is added to connect them, with a direction from the first one to the second.
It is a proven result in graph theory that you can convert any graph into a DAG, there is an efficient algorithm for it.
What it means is that you can, using the above method, represent any computer program on a DAG; at the end of the day, all programs can be thought of as functions calling other functions, we know this from lambda calculus and functional programming languages.
Once you have this structure in place, you also build a large table that you populate with the result whenever a function is executed with a certain set of arguments. Next time you call the same function with the same arguments, you don’t execute the function, you just look up the value in the table. This is called memoization, you memoize function calls.
When you put memoization together with the dependency graph, you have the ability to execute the same code over and over again, applying different inputs to it, and only ever executing those functions that are affected by the change in the input; the rest of them are looked up in the table.
In financial risk applications, there are, daily, many thousands of runs like this for which only one value in many thousands of inputs changes (when you shock a single quote). This can mean skipping millions of function executions – models don’t get recalibrated if the quotes they use are changed, products don’t get repriced and so on. And all this is built in: you don’t have to do anything special for it, once you write the code for your risk calculation within this system, it looks like any other code but you get this massive optimisation for free.
Or so they say.
Billions upon billions of dollars have been spent on implementing these systems; it is very hard to do.
Firstly, for the dependency graph to work, referential transparency is essential (because otherwise memoization doesn’t work). But that means that all data flowing into the graph must be strictly controlled, we have to know when it’s changed.
In the business as data driven as financial investments, and with millions of lines of legacy code that was developed over decades to deliver quick solutions, in various languages with various technologies, it is impossible to guarantee that someone didn’t write a function 20 years ago that is still used and that loads data internally from a file, or a database, or whatnot. This is a much stricter requirement than is usual, and it means that huge swathes of code have to be rewritten – everything that reads data has to somehow be controlled, and in practice it means that it has to be rewritten in the new language.
In addition, even then, the data flowing in needs to be tracked for changes, and comparing the values bit by bit is prohibitively slow (by the time you’re done, most calculations would have been done anyway). So where people land is that they have to have special data sources, tightly integrated into the language where they can quickly check data for changes. Bi-temporal databases were either invented or highly developed for this specifically. And this becomes the only valid and approved data source. The only one. Has to cater for all speed, all volume, all data. The architecture of it becomes insanely complicated.
Then, if you even manage to do all this, the dependency graphs themselves end up being so massive, memory explodes. Many compromises have to be made that in turn compromise the simple ideas that the execution model is based on; this makes code horribly unreadable and difficult to work with, because you effectively end up constantly switching paradigms.
Finally, the execution model is so different from what you usually see, that debugging it takes a whole new set of skills and tools; and the tools are never developed for it (it’s all already super expensive). This makes developers hate them, and often users too (because the code ends up being slow); not to mention all the people who either built or invested their careers into building, the code that is being replaced by this new stuff. The cost of an unhappy workforce is immense, in money, code quality, and karma.
In the end, a neat and elegant idea, much interesting stuff got developed along the way; but the size and cost of these projects means they are always horribly burdened by a brutal power struggle and the required amount of planning never happens.
Dictionaries/Maps
Dictionaries and maps are synonyms. C++ calls them maps, Python calls them dictionaries; you can tell who learned which first by which term they use when they’re not careful.
They are a generalisation of arrays. In an array, you have a list of data that you can access by index, which is always a positive integer. In maps, instead of an index you can have any type.
Think of them as tables with two columns, key and value. A key can be of almost any type, most often it’s a string, but as long as the type satisfies some minimal conditions, it can be used.
There are only two implementations of maps that are ever used. You rarely see this level of implementation detail in daily life, every language has an implementation of maps for you to just use. In Python they are called dictionaries and are very important. Python uses them internally to implement modules, classes, all sorts of things, and you can access this directly to find meta-data about a type that a class implements, for example.
Red-Black Trees
The first one is red-black trees. (see above) Data is arranged in a red-black tree structure, and every node contains two values: key and value. The key is used for building the tree. Then when you want to access a value by a key, you traverse the tree depth first, looking for the node that contains your key, once you find it, you read the value that the node contains and you are done.
For the red-black tree to work, the type of the key has to be ordered – this means that it must be possible to compare two values. You need this to know which node in the tree is “greater” than the other one. For numbers, this is easy, for strings too (though often you need to think about case sensitivity). For other types, it may be hard: how do you compare two trees? Or two cars? Mostly it can be done, but if you go down that route you often have to implement these comparisons yourself.
Another thing that needs to be defined is equality: how do you know that you found your key? There has to be a way to check that they are equal. To minimise the requirements, often implementations are happy with just being able to establish which value is smaller; then out of two values, if neither is smaller than the other one, they must be equal.
Most often this works just fine.
But often, all these comparisons of values that you have to do end up being much slower than the tree traversal itself. Even with strings, once they get long, you have to compare them character by character, it’s just slow.
Then there is memory – you are storing the keys in the tree only so that you can compare them with other things in order to get to the things you really want.
Most often: who cares, speed doesn’t have to matter that much, and data rarely gets big enough for the memory to matter that much.
But sometimes it does.
Hash Maps
The other common implementation of maps are hash-maps. There are algorithms out there that can convert any bit pattern into a number. In fact, there are infinitely many of those; but there are some that were developed for cryptography that do it very, very cleverly. They convert any bit pattern into a very large number, with four important properties:
- The size of the number is always the same (a certain number of bits).
- Reversing the calculation to go from that number back to the original input would take a really long time (millions of years).
- The probability of getting the same number for two different inputs is miniscule (one in many billions or trillions).
- They are very fast.
These numbers are called hash values, or hashes. They are great for cryptography because of properties 2 and 3.
This is how credit card numbers or passwords are stored and transported across the internet. What is stored in databases is the hash value of your password. When you are asked to type your password in, a hash value is generated from it. Then this hash value is sent to wherever the database is, and compared with the stored hash value. If those are the same, you passed. Because of property 2, even if someone intercepts the internet message or hacks the database, they can’t reverse it to get your password. Because of property 3, the chances that someone would accidentally match your input, are so small, we just take the risk – never really happens.
The most used algorithms are SHA-256 and SHA 512, they are a part of SHA-2 family of algorithms which all produce values between 224 and 512 bits long. These are enormous numbers, but when encoded using base-64, or base-128 numbers (see the chapter of binary numbers), they end up being relatively small strings (sha-512 ends up as an 88 character long string, others are much shorter). The smaller the number, easier to crack and larger the probability of clashes (two different values producing the same hash) – though still something like one in millions.
Now – as far as cryptography goes, people have found bugs in the algorithms that let them break them in the past, all fixed now. Also, with the amount of raw computational power at hand with clouds and GPUs, it is becoming viable that they could be broken by simple probing – try out a very long list of words and word/letter combinations and see if one works. This is why captcha questions are there to annoy us (amongst other reasons). Most interestingly, quantum computers will in theory be able to break this almost instantly at some point. A storm is brewing.
Back to hash-maps though.
Now that we have these smart algorithms, we have a way to convert any pattern of bits, any chunk of data into an (almost) unique number. And these are all of the same size that we can pick (normally we pick something small, but it depends on how much data we are looking at). Numbers are great. You can easily sort them, quickly compare them, you don’t really even need to bother with trees, we can sort numbers real fast in a list or an array. On top of it all, hashing algorithms are so well established, most languages will provide them in one library or another.
To build a hash-map, you make a list, or an array of records. Each record contains a hash-value of a key, and the actual value. To add a key-value pair to your map, you calculate the hash of the key, find a place for it in the list, in order of size, and store it there. To find a value for which you know a key, you compute the hash for the key you are looking for, and look for it in your list. There are very quick ways to search through a sorted list of numbers. If you have lots of data (and with these you often do), the chances of clashes increase, so almost always, instead of storing just the value in your record, you make it an option to store a list of them, together with their keys – then even if you have a clash, there will be only a few of those, so you store those duplicates in the table. When you are looking up a value, you first match the hash, then quickly zoom down a short list to find the exact match for your key.
Turns out, this is often quicker than the red-black tree, because these keys are much faster to compare. Sometimes it’s slower, of course, all those calculations take time. For smaller amounts of data and not very unusual keys, the tree based map is probably a better choice.
Where hash-maps really shine are special use cases. First, when your keys are of a type that is complicated to compare; computing a hash for it may then be much faster (this is quite unusual, but it does happen). Much more often, you do this when you have lots of data – the searching becomes much faster then.
Distributed Network Caches
And then there are distributed caches. Hash values are unique, it doesn’t really matter if instead of having a single table indexed by them, you have many. You can instead of having a single one have a hundred, each containing thousands of values. Then when you are searching for a key, you go through them in some order, one after another, till you find our value. It all still works because the uniqueness of hashes is universal.
If you have a lot of data, more than you can fit in your RAM, you can store each one of these lists into a separate file and load them into RAM one by one. With that much data you can’t avoid using files on the disk, but at least once you get to the file you need, you can find your stuff quickly.
You can get even smarter and figure out ways to find the correct file faster. I don’t know the details, there is much science to it, but for example you can put all the values where the first 100 bits of the hash key are in a certain range in one file, the next range in the second file and so on; then as soon as you see the hash you are looking for, you know which file it is it. Stuff like that, but it gets much smarter.
You can go wild too – you can have separate parts of your massive table running on different servers, each accessible through some sort of network API (e.g. REST). Then when you need a value, you broadcast a message to all these servers, each quickly looks up their own table and only one comes back with a result. You pay for the wire time, but your lookup is very fast across data of potentially vast size. If you have lots of demand, you can even create many duplicates, who cares, overall the hashes are unique, throw as many servers as you got at it, use the first response you get back.
This is how distributed network caches work.
Most kinds of network caches and file stores use some variant of this. Large databases use it all the time. Providers like Spotify or Netflix depend on it, Google too.
There can be very useful novel applications too. I designed a large-scale risk optimization architecture based on this for a large Canadian bank (everyone loved it, but never found time to implement it 😦 ). The idea is that a large risk run, EOD, VAR, Stress will contain thousands of repeated pricing requests as a part of bump and reval calculations – these are the basic building blocks of every risk run. I found a way to build a hash value for a complete set of inputs for every one of these computations (not hard), and then the core of the design was to store the result of each of these individual computations in a distributed cache the first time it happens. There are very efficient implementations of such caches available and already used a lot. Then every time you need one of these calculations, you first look up the cache, if it’s there use it, if not compute it and add it to the cache. At scale, when you run millions of these all the time, it would save orders of magnitude in time. We know this because we know how it all works, but it is also proven in practice through all the work on development of dependency graphs in banks (even though this is done at a very different granularity). Crucially, it wouldn’t require changes to the analytics code – it’s a pure technology solution. Once you have the technology in place, it would be easy to play with granularity of what you store for different use cases (entire risk runs, all the way down to any part of it per trade or per book). It would cost, but not all that much, the infrastructure to support it always already exists, all this technology is used extensively anyway. And it would open many doors – once you have this data, you can store it long term, rerun’s become way cheaper. Quite a lot of inputs don’t even change day to day, find them and reuse the caches over time too.
Ok, enough bragging; the larger point is: these techniques are rarely visible through layers of documentation and marketing, but they are well established, very well developed, and all over the place. You can build amazing things with them.
Sets
Sets are things that contain other things.
I know, I know, it’s obvious, but there’s no better definition, really. It is a starting point of all mathematical formalism we have. It’s all based on sets, it’s the starting point, and therefore we have nothing to define it in terms of. Just the intuitive understanding that you can put things together somehow. “Axiom of choice” is a name for a slightly distilled version of this; it grates people still that all of our mathematical formalism depends on this assumption, on this innate intuition; but such is life.
Anyway, in algorithms you use them to keep track of the presence of values that may or may not appear in data – user options being set or not, instances of data that you only want to process once etc
The things you can do with them is add an element to it, check if the element is there, and usual set operations (unions, intersections and such).
Most languages provide an implementation to use, and in most cases they are implemented as maps with no values, just keys.
For sets of limited size, there is a very efficient implementation using bits in a word. The values are then often called flags, and sets are then often called bitsets. Checkout the section on binary numbers and bitwise logic to see how this works.
Queues
Queues are like lists, but you can only add things at the end and only read the first element. So your elements get queued as you add them.
The terminology is like for stacks, you push something into the queue (it gets appended to the end of it), and you pop something from it (read and then remove the front of it).
Much like stacks are uselessly described as LIFO data structures, queues are described as FIFO data structures (first in first out).
They are pretty much indispensable in event driven programming. This is a model of situations where things happen and you have to react to them. Like GUIs for example. Here, things happen in order, and even if you process them much later, you need to process them in order. Queues preserve the order of insertion, that’s all they do.
Most often they are implemented as linked lists.
Priority Queues
These are actually very different from other queues. Reading them is the same, you can only read one thing at the time; but here it is the thing with the highest priority.
As you add things to a priority queue, you have to declare their priority. Then when they are read, you always read things with the highest priority first.
They are called queues, but they are in fact implemented as DAGs.
Your Own Data Structures
There are many more examples of data structures around, and many variants of the ones described. They get created all the time. The set we described is the core one and will work in most cases you would ever encounter, you should understand them well, and be able to pick amongst them with joy and confidence.
Do keep an open mind though. There will be times when you can do better.
The key point to remember: even though we talked a lot about implementation details; this was because we believe knowledge matters. But implementation is a secondary concern, an abstract data structure is fully defined by how you add data to it and how it is retrieved, or traversed. The efficiency of both additions and access matters, this will always be a concern when choosing which one to use.
The actual implementation comes second to this. All data structures can be defined in many ways (infinitely many? probably); once you add constraints on the efficiency of access and additions, the options become much reduced (you are limited by how machines work); once you add memory considerations to it all, like use as little as you can, it most often becomes obvious what to do.
This is why understanding the implementations of the common ones helps – decades of work went into making those as optimal as possible; lots of knowledge is encapsulated in what we came up with at the end of all that. Even when you implement your own ones, you will be relying on those results.
People will not like it that you are implementing your own data structures, it means they have to think more when reading your code. Even without looking they will call it “re-inventing the wheel”. Don’t worry about that too much. You own your abstractions, it’s your call. Besides, re-inventing the wheel is not intrinsically bad, it only matters if you are really pressed for time.
Algorithms
An algorithm is a description of how to do something.
Once we put together a good enough description, other people can do it too. If the description is precise enough, even machines can do it. For this to work, the description has to be very precise, because machines don’t have the imagination, or the theory of mind, to figure out what you meant; they just have to be told what to do exactly, without relying on the common understanding of the motivation or context. All these abstractions that we are talking about here, and programming languages themselves, are there to make this precise expression easier.
Not everything can be described by algorithms. How do you love? How do you write a poem that moves people? How do you invent a new thing? Some of this can be mimicked, but the spark of a new idea, or a new way to express beauty or pain, will be missing; and we feel it. We don’t know why. We don’t even really know how the mimicry of it works, or how it is that we can eventually tell that it’s mimicry, even when it is very successful.
Those things, however, that we can describe as a sequence of steps that can be repeated are described by algorithms. There are many well known and researched ones around, because we have a good toolkit of abstractions to be able to apply them over and over again. When you write a program, you often use them, and you come up with your own ones.
Sorting and Searching
Every single course and book on programming will have at least a chapter on this; yet, in my entire career I implemented either of this only a few times, if that, and mostly because I was bored, or doing homework. The algorithms for these are so well understood that they are always available in any language. Even if not, you implement them yourself by following a very precise recipe you can google.
It always annoyed me that I had to read about these over and over again; but there is a good reason, alas, to suffer understanding the commonly used algorithms: almost all the time used by computers is spent iterating through data or searching through data. Iteration (traversals) we talked about when discussing data structures, and will talk more about separately; and searching – well it is much faster if you are searching through sorted data.
If your data is not sorted, all you can do is go through all of it, until you either run out of it or find what you are looking for. This is called linear search, and the larger your data, longer it takes on average (it is of order n, where n is the size of your data).
If your data is sorted though: check the value in the middle, if it’s bigger than what you’re after, check the value in the middle of the left half, if it’s smaller, the value in the middle of the right half. Keep doing this till you find it, or realise it ain’t there. This is called binary search (or bisection).
Binary search is of order lg(n) (logarithm base two of the size of your data). That means that if you have 8 values, it will take at most 3 lookups, for 256 values, at most 8, 16000 values, at most 15 … it gets better and better compared to linear search the more data you have.
So … it’s well worth sorting your data whenever you can. But if you don’t think about it, sorting itself can be really slow. The standard algorithms get clever, and like with data structures, we spent ages figuring them out. It is well worth studying them in detail, even implementing them, just because they contain a whole lot of knowledge embedded in them. The techniques used in them are used again and again to make pretty algorithms.
I will not explain them – so much has been written about them, I stand no chance in making the best explanation. Google: bubble sort, selection sort, quick sort, and heap sort. Quick sort in particular is one of the prettiest algorithms around, and the central trick in it is a powerful technique that you will end up using over and over again in your own algorithms.
Also Google “searching algorithms” and “sorting algorithms”, just to see what else is there.
No, really do. And implement them yourself, and test them. It will hurt, but you will learn so much.
Variables
Variables are named things that have a value, a type, and a scope.
They let us talk about algorithms without having to refer to the entire domain of values, or having to constantly refer to the meaning of a particular value that we are using. Instead of constantly repeating “the integer value that is used as an index to access a particular value in an index of strings”, you declare variables index and string_array and off you go. They document the meaning of values in the context we are dealing with.
Scope and Type
The context is usually limited, a variable’s name is only meaningful within a function, or a certain block of code, or while some larger object that the name is a part of exists. This is called the scope of a variable; a variable is only meaningful within its scope. Meaning is life, so people often talk about life-time of variables; what they mean by that is the span between the start and end of the scope of a variable.
Programming languages have strict rules about scopes of names of things, and compilers implement algorithms to keep checking it. It helps a lot, because switching contexts is hard work for our minds, and mistakes are very common.
There are, in most languages, such things as global variables, they are just variables whose scope is the entire application. You are always told not to use them, because it is easy to get confused and have some piece of code change their values in ways that change behaviour of some other part of code (this is another kind of a side-effects). What it really means is “it’s too hard to think about clear distinctions between contexts when they are within other contexts”. It is for some people, so I generally try to be kind to them and not use global variables unless it really makes a lot of sense.
Most often it makes sense to make the scope of your variables as small as possible; there are always a lot of them and it becomes hard to keep track. It’s a stylistic guideline only though, you own the text of your code, you decide what is the best style to convey the meaning to the reader.
The type of a variable is the type of the values it takes. It is always the same as the type of values, except in two cases: references and constants.
A variable can be a reference to another variable, a new name for the same thing; then its type is “reference to the type of value” (reference to integer, reference to string). It helps to use this when the same values are used in different contexts, you rename things so that they make more sense within each context. A special case of references are pointers. A variable that contains a pointer to a value is of the type “pointer to the type of the value pointed to” (pointer to integer, pointer to string). A pointer itself is just a number, so in the implementation on the machine these are always integers, but in code the meaning changes significantly depending on what the value pointed to is. The distinction matters because you can do things to pointers themselves that do not affect the values pointed to. For example you can “increment” a pointer variable to make it point to the next value in memory. So a pointer variable is two abstractions at very different levels packed into one. Takes some getting used to.
A variable can also be constant (I know, an oxymoron), this is then just a name for some constant value you use in your algorithm. The most common example in books is PI, instead of writing 3.14159 every time. Far more generally, you want to name things to convey their meaning better. Not doing it means that you will have numbers or other constants in your code that your reader will have to understand every time – people call them magic numbers, sarcastically. The type of those variables is “constant type of the value they name” (constant integer, constant string…).
In most languages the type of the variable is static: once declared or deduced it cannot be changed. In others, you are allowed to assign values of different types to the same variable and this changes the type of the variable. This is called dynamic typing and is used in Python.
A variable comes into existence when you declare it, when you mention it for the first time. In most languages you have to do this explicitly and you have to specify the type of the variable at the same time. Some languages let you just start by assigning a value to a new name and they take this to be the declaration; they deduce the type of the variable from this first assignment (the type is the same as the type of the value assigned). Python does this. Some languages insist on type as a part of the declaration, but do let you explicitly tell them to deduce it for you like this (C++).
Once declared, the variable has a name, a type, and a scope. Assigning the first value to it is called initialization. This is sometimes implicit, if you don’t do anything about it, a default value is assigned to it. Most often you will want to assign your own value to it, even if it is the same as the default value; it is most often easier to read stuff when you don’t have to remember the rules about default values for various types.
There is often more to do when a variable is created, and this depends on the type of the variable. If it is of a complex type, components of it need to all be initialised, often some storage needs to be allocated for it on the heap too. This process is called construction and many languages provide special syntax for writing functions that deal with construction for your own types – constructors.
When the code execution moves beyond the scope of your variable, the variable is said to go out of scope. It is then destroyed. There may be things to do when this happens, mostly release memory (or other resources) that you acquired at construction time. This is called destruction and some languages have special syntax to write functions that deal with this – destructors.
Assignments vs Equality
There’s an annoying detail to do with syntax in most languages.
Assigning a value to a variable is very different from checking if the variable value is equal to something. Assignment is a complicated thing, allocate memory, check the types, maybe do some type conversions, copy data into the memory. Equality checking is just comparing some bits. But more importantly, the meaning is very different.
Alas, when writing this sort of thing on a piece of paper in mathematics, it is always completely clear from the context which one it is, and we just use ‘=’ for both.
In programming languages we have to rely on compilers to figure out the meaning, and they just can’t always do that.
Especially since some bright spark while designing C decided that assignment operation can also be considered to be an expression, just so that you can write a=b=0 (assigning 0 to both), and now suddenly you can check if an assignment is true or not (!). Ok, ok, it is a little deeper than that: an assignment is conceptually a function, it does things, and if you are viewing programming from a functional point of view, it is reasonable for functions to be returning values; and it is reasonable for the value of the assignment to be the value of the variable; and on the other hand, it is reasonable to be able to check any value for equality with another value; in the end, the problem is actually with our conventions in mathematics being ambiguous, we just didn’t care before, because we could always add a note in a human language to clarify it to a human reader.
Anyway, in most programming languages you assign values to variables using ‘=’ and you check if things are equal using ‘==’. In Javascript there is also ‘===’, because there is more than one way to compare things. If you ask me, that’s just rude.
Implementation of Variables
At the machine level, a variable is just some memory that machine code manipulates. If possible, compilers will try to generate code that keeps all the memory that contains data for a variable in a contiguous block. This helps reduce the amount of paging, because once you access a variable, you’re likely to do it again soon. The pointer to the address where the block of memory starts will end up in a register on the CPU whenever needed, to make access faster.
Where is this memory allocated?
Stack vs Heap
As often as possible, compilers will try to allocate memory for your variables on the stack (this is called stack allocation, and these variables are said to be on the stack). Because both allocation and deallocation on the stack are very very fast.
Rule of thumb is that local variables within a function will be allocated on the stack, but different programming languages have different rules about the details. Compilers will make sure that once your variable goes out of scope, stack memory is deallocated automatically.
However, stack is limited in size, bigger things need to go to the heap memory. In some languages anything that is big or complex gets allocated on the heap automatically (Java, Python), in others you decide (C, C++).
When you allocate something on the heap, you only ever get a pointer back. You access the value of your variable through the pointer, syntax will be provided for you to that in the language. In languages where you do this yourself, you are also responsible for deallocating the memory you got yourself, when you are done with your variable. Generally you invoke a special function provided to do this. This is a pain. In all other cases, the memory gets deallocated automatically, so you forget, and then you have this memory that is allocated to you, but the variable is gone out of scope and you can’t even access it to get rid of it. This is called a memory leak – eventually this lost memory piles up and you just run out or RAM, your next attempt to allocate memory fails and your program crashes.
Almost all languages provide features to help with this: C++ has smart pointers, Java has garbage collection and very strict scoping rules to help it deduce when it is ok to free memory, Python just deals with all of this for you behind the curtains, there you can only tell what’s on heap and what’s on stack by reading documentation.
Heap memory allocation and deallocation is much slower than stack; you avoid it if you can, but very often you just can’t: data types are too complicated, or too demanding. You just have to live with this, or trust a language to do it for you. When languages do it automatically, they are slower then when you do it yourself, because they have to play it safe to avoid releasing memory that you are still using.
It is burdensome, but it keeps you real; like we talked about when we were talking about pointers: the context within which you are modelling the domain you are looking at matters, and in this case it is a physical machine. When you are painting a landscape, you will do it differently in watercolours and in oil; same thing.
Naming of Variables
So much is written about this, it’s sickening.
It matters how you name things, because the name you give them carries the meaning. The names are a part of the story telling. You want them to be expressive enough so that once you are familiar with the domain, you can just read code top to bottom. It is hard, because it is not about a single name, all of them together make up the story.
That’s really all there is to it. If you want your code to be beautiful, you have to think about how you write it.
There are a million complicated conventions that promise to make this easier by means of syntax. It’s bullshit, you don’t get good style by following strict rules.
Anyway, it is worth learning about all these conventions, people insist you are a bad person if you mix them or ignore them, and will at times revert your code if you don’t follow the agreed upon conventions in their code base (assholes, AssHoles, ass_holes, strAssHoles, or ass-holes, depending on which convention you use – flatcase, CamelCase, snake_case, hungarian, kebab-case, respectively). Read about it, it does help a to know about them, lest you are accused of not following them because of ignorance rather than thoughtfulness.
Choose one if you like it, try to be consistent, but above all, make it beautiful.
Logic and Branching
To make anything interesting happen, we need to be able to make decisions.
In algorithms, this means that based on something happening or not, we execute a different part of the algorithm. This is called branching; algorithms look like graphs when you draw the flow of execution, and when we make decisions, we change the branch of the algorithm being executed.
All programming languages provide means of doing this, almost all of them by means of IF – THEN – ELSE constructs. The exact syntax differs between languages, though not by much. Some provide shortcuts and special forms for when you are choosing one of a number of options (CASE statements). We’ll mention them all.
There is a deep result in theory of computation (within Church’s Lambda Calculus) that shows that you can do all branching by only implementing a function that returns the largest in a sequence of results (max – could also be min, they are equivalent, just change the signs of your results). Pure functional languages sometimes use this. It works, but it is very hard to implement some important optimisations, and it is harder to read. The point is – there has to be a way to branch code if your language is to be able to express all algorithms; even if the mechanics of it are not obvious.
Conditions for branching are often complex and consist of independent parts that are combined to form an entire expression of the condition. This has been studied for millennia, and we call it propositional logic.
In the 19th century, a guy called Bool, formulated a way to work with propositional logic using numbers, 0 for false, 1 for true. We call that Boolean algebra. 0s and 1s are great for digital computers, so mostly this is all referred to as Boolean logic, Boolean values and so on. Most people don’t even know what propositional logic means, they just use the term Booleans or Boolean logic. That’s fine of course, but there is much beauty and depth in understanding where it all comes from.
Propositional Logic
The core principles of propositional logic are very simple, and were formulated in almost the same form by Aristotle and later Stoics in Greece and in several places in India and China. There was likely more instances, lost in history.
We built up a lot of formalism around it; these are the very foundations of mathematical formalism, but roughly it goes like this:
Assume you have statements (propositions) A and B. Each can be either true or false. Then we define a small set of operations to combine them and transform them:
A AND B is only true if both A and B are true.
A OR B is true if either of them is true.
A XOR B is true only if one of them, but not both, are true.
NOT A is true if A is false and vice versa.
You can then combine them into longer propositions, and if you add brackets and a few simple rules about distribution and commutativity (much like for addition and multiplication), you are all set. The top three are commutative, A AND B = B AND A etc
They are also distributive: (A AND B) OR (C AND D) = (A OR C) AND (A OR D) AND (B OR C) AND (B OR D).
AND and OR are idempotent, A AND A = A, A OR A = A.
NOT works like a minus, for negative numbers, NOT A AND B = (NOT A) AND B (you negate first).
Also, there’s something called De Morgan laws, just like a minus in front of brackets for addition: NOT (A AND B) = NOT A OR NOT B, NOT (A OR B ) = NOT A AND NOT B.
Curiously, and importantly, these operations are isomorphic to basic set operations too: AND acts like an intersection, OR like a union, NOT like a complement. You can think of sets of all things that are true in a given context, and logic operations as ways to find subsets of values that are true across contexts, or domains. When you have lots of data, you can start building statistics about values being in these sets or not and from there make inferences about them. This is why machine learning works. Conversely, when your domain is not simple, and you are not always sure if your data belongs to the set of true values, you can express the measure of your belief in the truth of it as a number between 0 and 1, and apply the rules of probabilities to compute values of logic operators. This is how fuzzy logic works and how logical inference would be implemented in analogue computers.
The truthfulness of a proposition is not assumed to be objective, all that matters for this to work is that the context is consistent; we agree on what we think is true and false for the starting points. After that we can apply the rules to check the truthfulness of propositions of increasing complexity, but only within the given context. The meaning of truth stays with us, as does the interpretation of it.
Because of this relationship with the meaning, these rules actually help simplify expressions and make code much more readable.
One of AND, OR, and XOR is superfluous, you can express each in terms of the other two, for example: A XOR B = (A OR B) AND NOT (A AND B).
In mathematics, propositional logic doesn’t usually bother with XOR, because it’s weird, it’s not idempotent: A XOR A = FALSE , always. The reason we have it around is because it is sometimes handy, but primarily because it is very easy to implement in electronics, and gives a super fast way to set a value of a register to 0. Initialising variables to 0 happens very, very often; and this makes things a lot faster.
Since we are there already …
Boolean Algebra
While it is at its core just another encoding of the same principles above, Boolean algebra provides a phenomenal insight that makes logic gates be so effective.
If you represent true as 0 and false as 1 (and ignore carry bits), then:
x AND y = xy = max(x,y)
x XOR y = x+y
x OR y = x+y(1-x) = min(x,y)
NOT x = 1-x
So, whether you find a way to compute either a few basic logic functions, or a few basic arithmetic operations, or a few basic functions, it’s all equivalent and you can move between these as simply different interpretations of the same basic building blocks of computation. Figure out how to automate one and you automated all three. Logic, arithmetic, functions. It can’t be expected a-priori, we developed these branches of knowledge completely separately. There is more detail needed to make it work completely, but it is all well known and understood, and it does work: all these algebras are isomorphic.
The core units of electronics in microprocessor design are logic gates. These are small circuits that implement the four operations listed above. AND, OR, and XOR gates take two inputs, each either has voltage on or off (high or low, for 1 or 0, true or false), and one output whose voltage corresponds to the output of the operation when the two inputs are combined. NOT gate just inverts its single input. But these also implement arithmetic. There are millions of these gates combined in various ways in a CPU. The fact that they all deal with only two voltages is why they are digital computers. They are super fast. Functional languages make use of the fact that this is also an implementation of a minimal function set.
There is a little bit more needed to be able to express any algorithm there is, we will talk about this later, but the complete set of operations is very small.
You don’t generally see this, or have to think about it, when writing programs in anything other than assembly, but this is where our abstractions ultimately meet physics.
Most programmers take all this beauty of propositional logic for granted, it looks simple and intuitive because it is ingrained in our thinking: we have been relying on this explicitly or implicitly since it was first formulated. It is a mistake: all logic, including all we know about algorithms and thinking, and all the details of how computers work ultimately stem from this.
Short-circuiting
Because logic expressions are evaluated so often, programming languages very often implement an optimisation called short-circuiting.
AND is only true if both propositions are true, so if the first one is false, you don’t even have to check the second one. Similarly, OR is only false if both propositions are false, so it the first one is true, you don’t have to check the second one.
Often in code the calculation of the truth value of something is a long and complex operation (for example, one of the propositions could be that an output of a computationally expensive function is 0). Very often there is an imbalance, one of the propositions is much faster to compute than the other; for example “user ticked the box to check this AND the result of the function is 0”. You can then make code faster by simply typing the faster-to-check proposition first. Then you automatically do the slow stuff only if you must.
Another common use of this is when a part of the proposition only makes sense if something else is first satisfied. You write the pre-condition as a first part of an AND operation, and the condition you are actually checking as the second one; if the precondition is false, condition computation never happens.
Branching
With propositional logic we have the means to compose conditions of any complexity we like. We can use this to make decisions about which part of an algorithm or computation to apply next.
The most common construct by far to do this is IF-THEN-ELSE. Every procedural programming language will provide this. The way if works:
IF Some Condition is true THEN
Do one thing
ELSE
Do another thing
You can nest them, and most often you do:
IF Some Condition is true THEN
Do one thing
ELSE IF Some Other Condition is true THEN
Do another thing
ELSE IF Yet Another Condition is true THEN
Do yet another thing
ELSE
Do what you can when nothing is true
The “is true” part is tiresome to type, so you can always skip it, it’s assumed:
IF Something THEN
Do one thing
ELSE IF Something else THEN
Do another thing
ELSE IF Something different THEN
Do yet another thing
ELSE
Do what you can when nothing is true
Some languages skip the THEN part, most have some way to separate the block of of code that does things (most often by putting in curly brackets, though in Python it’s just the tabbing that separates it, like above), some have a keyword ELSEIF, or ELIF (Python) to make the relationship with the first IF more obvious.
The conditions themselves are always implemented as expressions that evaluate to a value of a Boolean type. It is cool and helpful, because these can be complicated and you can have variables of Boolean type to store them if it takes a complex algorithm to decide, or you can have functions that return Boolean values.
Boolean Type
In other words, there is an abstraction of a Boolean type, which is a value of a thing that can be used in a branching statement. Because of the importance of making decisions in algorithms, it is one of the core abstractions, if not the single most important one. You can express any algorithm as an evaluation of a Boolean value of an expression (programming language Prolog works like that).
Most languages will provide a Boolean type as one of the PODs, simplest types built into the language itself. Some will use integers instead, 0 for false, anything else for true (C does this), but the meaning doesn’t change.
Because of how ubiquitous all this is, many languages will provide shorthands for common conditions – in Python, you can view a string or a list as a Boolean, if they are empty, they are false, otherwise they are true; in C++ if a pointer is NULL it is false; Javascript goes crazy and does this with almost any type (and has the concepts of truthy and falsy to describe this). So in Python if you have a string variable called user_input, you can write:
if user_input:
print(“you said: “ + user_input)
else:
print(“you typed nothing, asshole”)
Other Forms
Very often the conditions in series of if – else if statements are simply checking which of a range of values does a variable take, something like:
IF colour is red THEN …
ELSE IF colour is blue THEN …
ELSE IF colour is purple THEN …
…
ELSE …
All the else ifs and thens look like fluff, they distract from the fact that we are switching between limited number of options. Many languages provide switch-case statements for this, along the lines of:
SWITCH colour:
CASE red: …
CASE blue: …
CASE purple: …
...
DEFAULT: …
They are most often used with enumeration types, and compilers would then check and warn if you missed a value in your list of cases. At the machine level, case statements can be implemented even more efficiently than if – else if chains.
Languages that support lists and functional program constructs provide ways to use Boolean conditions as filters. Create a new list from an existing one by picking only those elements that satisfy a condition. In fact, filtering like this is one of the basic constructs of functional programming. (Python and C++ also support this).
Many languages provide shorthand for assigning one of two values to a variable depending on a single condition.
And so on … branching based on results of Boolean expressions is everywhere.
Implementation of Branching
Branching is one of the core requirements for the ability to express all algorithms. CPUs implement direct support for it and many optimisations are built into them for this.
At a machine level, there is no concept of structure of code in terms of code blocks between THEN and ELSE, or even functions. It’s all just instructions, one after another, each at a particular memory address. The address of the next statement to be executed is stored in a special register (Instruction Pointer). CPU read the Instruction Pointer and executes the instruction at the address it read there. Every instruction at the end of its execution updates the Instruction pointer to point to the next instruction to be executed. Normally it is just the next instruction in code. There are special instructions though, that change the value of the instruction pointer to jump to something else.
Unconditional jump instructions, just put in some constant value. Conditional ones are explained below.
The way branching is mostly implemented is using these jump instructions.
This relies on the insight that the entire IF-THEN-ELSE construct consists of four parts: the condition (stuff between IF and THEN), the block of code to execute if the condition is true (stuff between THEN and ELSE), the block of code to execute if the condition is false (stuff between ELSE and the end of the entire construct), and whatever comes next in your code (let’s call it the rest of code).
First, all the code is laid out like this:
- First come the instructions that compute the condition.
- Then the instructions that put that result into one of the general purpose registers (say RA).
- These are the instructions generated by the code between IF and THEN in your program.
- Immediately after that goes a conditional jump instruction. (explained in a minute).
- Then all the instructions created by your code between THEN and ELSE.
- Then an unconditional jump instruction that makes the CPU skip the next block and land on the first instruction of the rest of the code.
- Then all the instructions in the block after ELSE.
The execution then goes like this:
- The condition is evaluated and the result is stored in RA.
- Conditional jump instruction happens next. It looks up RA.
- If the value in it is one, it just increments the instruction pointer by one, and goes away. The execution continues in the block between THEN and ELSE. Just before ELSE, it hits the unconditional jump instruction we put there, and skips the ELSE block.
- If the value in RA is zero, the condition is false, and the conditional jump instruction puts into the instruction pointer the address of the first instruction after the ELSE, the first one after the unconditional jump. The execution then jumps there and executes the ELSE block.
- Either way, once all the instructions in the construct are executed, the execution continues where the rest of the code starts.
It is a little involved to follow, I know. The key point is that there is very little special that has to be done on the CPU level – the FETCH-EXECUTE-WRITE execution cycle and a little bit of Instruction Pointer manipulation just make it work. All the instructions end up in a sequence in RAM, the next instruction is always the one pointed to by Instruction Pointer, and it’s all very, very fast.
It all relies on the layout of the code in memory, and code encoded as pure data – everything is just numbers. The structure of the data is as important as the values of the data and the interpretation of the data. We need all three to describe algorithms, and the thing that makes them a whole is the meaning. Whatever the meaning is, wherever it comes from, it is not directly encoded, it emerges from the interplay of the three components. We feel it, know it, and rely on the ability of others to feel it and know it in order to convey it. That is the part that we cannot automate, and that is where beauty lives.
You can be, up to a point, an efficient programmer without ever recognising this, but you cannot build good, beautiful, or elegant things without relying on it. It cannot be described by rules; we keep trying to, only to discover more evidence that we will always fail.
To be a good programmer you must be brave and embrace chaos, with nothing but what you feel to be good to rely on for guidance.
Iterations\Loops\Recursion
Like branching, the ability to perform the same task many times over, with some parameters changed every time, is another core requirement for the ability to express algorithms.
Amongst the many ways to do this, eventually we settled on the few most common and convenient abstractions.
The first tasks that we wanted to (and did) automate with computers was solving mathematical equations using numerical methods. We knew about these for a long time, way before we had any chance of building a machine to do it. The way these methods generally work is that for a particular equation you figure out a formula into which you feed a guess of a solution, and it gives you back a better guess. You feed this better one in, and get an even better one. You keep doing this until the difference between what goes in and what goes out becomes really small, you know then that you are close.
These repetitions are called iterations, from Latin iterare – to repeat; we came up with important ones while most science was still being written in Latin (Newton & co). Later we started putting solutions to important equations for a series of parameters into tables that we then used when computing statics and dynamics for bridges and locomotives. The tables were entire books of numbers, entire industries relied on them. The people calculating the table values were called computers. It’s a horrifically boring and repetitive (ha!) job, the tables were full or errors. From around halfway through the 19th century we were trying very seriously to automate these iterations. Babbage was obsessive about this. Even then, famously, Ada Augusta Lovelace, being a daughter of a poet, realised that the potential extends far beyond just that.
We now use the term in a more general way, any repetition of a task, often, but not always, with changing inputs.
When we started building computers that use von Neumann architecture, we started thinking about an execution of an algorithm as being linear; a series of instructions executed in order, like following a thread. Repeating a task there is easily implemented by changing values of registers that are used as inputs in a single iteration and then jumping back to where the iteration starts … like making a loop in a thread. So we started calling these loops.
Recursion
We will talk more about it later, but for a machine to be able to execute all algorithms, iterating has to be possible. The view of computation taken by the functional programming paradigm is divorced from the details of implementation, and within it, it makes little sense to think or talk about sequences of instructions. There, the only unit of execution is a function. But we still need to be able to iterate. To do it there, we write a function that performs a single iteration and takes the iteration parameters as arguments. Then within the function, once we have the result, we check if it is time to end the iteration. If it is, we just return. If not, the function invokes itself again, but with the parameters for the next iteration. It keeps invoking itself until it’s done. This is called recursion and such functions are called recursive functions. We’ll talk more when we talk about functions. The important point now is that it is a proven result that the computational power of loops and functions is the same; anything that you can write as a loop, you can write using recursive functions, and vice versa.
Recursion is rarely the clearest expression of an algorithm; when it is, you will know it. Most often you use one of the other forms provided in most languages. In functional languages you don’t get much choice, recursion is the primary approach, often the only one. Large part of the reason why they don’t get used much.
Loops
There are variations to how loops are supported in programming languages, but generally they come in two groups, most languages support both.
The first one is some variant of WHILE loops.
They are very generic, and have two components: the loop-body, the part of the code that gets executed over and over again, and a condition, a Boolean expression that is evaluated before each repetition. If the condition is true, the body gets executed, and we come back to the condition again; otherwise the body is skipped.
Very much like the IF statement, except that there are no else blocks, and every time you execute the body, you try again. At CPU level that is how they are implemented – just like an if statement, except that the jump at the end of the first block makes the code jump back to where the condition is being tested.
In this form, the condition is often called pre-condition (because it is checked before the body of the loop is executed).
Syntax varies hugely from language to language, but looks something like this:
WHILE condition
loop body
Sometimes languages provide a way to check the condition at the end of the loop, often with keyword until, something like this:
DO
loop body
UNTIL condition
It’s not as useful as it seems, and it didn’t really take, you see it very rarely. The reason is that it makes it clearer to humans to see the condition at the top of the loop code – it is like a subtitle that explains the meaning of the loop.
The condition can be any Boolean expression, but you will generally write it in terms of some variables that change their value in the body of the loop, or some function that checks the current state of the world.
By far the most common use case in the early days of programming languages was the one where you simply wanted a thing repeated a certain number of times. For this, you would always declare an integer variable, initialise it to 0, then increment it on every pass through the loop, and stop when it reaches the count you want. Something like:
i = 0
WHILE i < 100
do your thing
i = i +1
This was so common that we came up with the shorthand for for it even before we had while loops (in FORTRAN) – FOR loops
These original ones looked something like this:
FOR i = 0 TO 99
do your thing
NEXT i
It turns out, this is very readable, tells you all you need to know about the looping part in the first line, and you even get a variable that keeps the count. You can nest them easily too (the body of the loop can be another loop). For a bit more flexibility, languages let you specify the range to be anything you like, and even change the increment from 1 to something else. To print every 10th integer between a 100 and 200:
FOR i = 100 TO 200 STEP 10
print i
NEXT i
When they were designing C, they realised that you can generalise this a whole lot. You have some parameters that you declare and initialise at the start, a condition that is checked at every entry into the loop, and some code that updates the state of the systems every time you go through it. FOR loops in C, C++, Java … the whole slew of languages are very generic like that. The above would in C look like:
for(int i = 0; i < 201; i+=10){
printf(“%i”,i);
}
The entire specification of any loop you can come up with is all in one line at the top; that has three parts: initialisation of parameters, the condition, and iterative step updates. Each can be empty if you don’t need it.
It is a good abstraction. In most languages some variant of this is the most common way to write loops.
Traversing Data Structures/Iteration Over Containers
You can express any algorithm as organising your inputs into the appropriate data structure and then applying a function to each element. On its own, this doesn’t really say much new, after all, this is what CPUs do; but this is a very powerful technique and very frequently this is what you end up doing.
What it means is that most loops you write by far are in order to traverse some container.
Even in the simplest FOR loops with a single integer counter, that counter is in fact most often used as an index into an array.
Most programming languages provide special syntax for doing this; in many they are called “FOR EACH” loops. Looks like:
FOR EACH element IN list
print element
To print every element in the list list. This is another generalisation of FOR loops, of course.
In this abstraction, every loop is just an application of some code to each element of some data structure, and in each step the next value from the container is assigned to the loop variable.
We call this iteration over a container.
Python doesn’t even have the counting FOR loop, all FOR loops iterate over containers. it has a data structure called range for it to do the counting version:
for i in range(100,201,10):
print(i)
is the Python version of the loop in the previous section.
Many languages let you extend this further, by allowing you to implement your own “iterators” – these are objects or functions that tell the language how to move between elements of a container. For example, for trees you could implement different iterators to traverse them breadth first and depth first. I don’t think I ever implemented one, most people never do. It’s hard and fiddly, but also, and the existing abstraction that you get included in a language are a pretty good set. I can think of examples when I would want to do it though, just never came across one yet, there aren’t many. (e.g. if I wanted to iterate through bytes of a number of files in order, without having to know that my data is split across a number of files, I may write an iterator that goes through one file as if it was an array of byte, and when it gets to its end, automatically opens the next file and start going through that one).
Functional View of Container Traversals
In functional languages, there’s no loops like those above. You can always implement them using recursion, but that is quite long winded and error prone. It’s great when you start, but quickly you end up thinking more about how to get recursion right than about your own context.
So they all provide ways to iterate over data structures using functions, and these techniques are by now available in most programming languages (Python loves them, C++ provides them in libraries that are always used, Java too).
The terminology in functional programming languages always refers to lists as the only data structure; you can implement everything else using lists and this minimalism helps prove theorems about the completeness of the language, speed of algorithms etc. So we will use that too, but assume that it is all easily adaptable to other data structures.
Ok, why would you want to traverse a list?
Turns out, there are only a few basic reasons possible for it, everything else is some combination of those. They are generally described as creating new lists from old ones, all can be optimised by applying operations to a list in place, if memory is precious; but that doesn’t change anything relevant here.
- (accumulation) To aggregate results of some function being applied on every element.
For example, to sum all elements of a list, or multiply them, or any other operation that you can apply cumulatively. In general: choose a binary operation (say ‘+’) and a default value (say ‘0’), set a variable called to the default value. Then go through the list and apply the binary operation to your variable and to the element in the list, and store that result into your variable. Move to the next element. By the time you are done, your variable contains the result of the cumulative application of your binary function to the list. - (filtering) To create a new list that contains only those elements from your list that satisfy some Boolean condition.
You create an empty list. You go through your original list, check for each element if it satisfies a Boolean condition, and if it does you copy the element to your new list. Generally, you implement the Boolean condition as a function that takes as argument the type of the element and returns a Boolean value. - (appending) To append one list to the end of another.
You create an empty list. You go through the first list, appending a copy of each element to your new list. Then you do the same with the second list. - (mapping or transforming) To create a new list by applying some transformation to every element of the original list.
You implement a function that takes an argument of the type of the element of the list, and returns a value of the same type. For example, to double every element in a list of numbers: f(x) = 2*x. Then you create an empty list. Finally you go through the original list, apply your function to the current element, and append the result to your empty list. - (zipping) To create a new list by combining elements of two (or more) lists element by element.
This is very similar to mapping, except that your function takes two arguments and returns one. For example, to add corresponding elements of two lists, you would use a function: f(x,y) = x+y
I don’t think I am missing anything, though look it up if you need very formal results to rely on.
The point is, with this set of operations, you can implement any algorithm you like, and all iterations across containers will be either one of these or a combination of them. They can all be implemented as functions that take as arguments an original list or two and a function to apply. For example:
accumulate( input_list, accumulated_value, accumulator_function):
if input_list is empty return accumulated_value
split the input into head (first element) and tail (the rest of the list)
starting_value=accumulator_function(starting_vaue, head)
return accumulate(tail, starting_value, accumulator_function)
Now, to sum all values in a list
f(x,y) = x+y
accumulate(input_list, 0, f)
To multiply them:
f(x,y) = x*y
accumulate(input_list, 0, f)
You implement your actions in functions, and you compose various ways of applying them till you are done. Functional programming languages do everything like that – using function composition as the core algorithmic construct. The splitting of a list into a head and a tail is a general technique to implement iteration using recursion.
Function composition and the higher order functions that implement iteration over containers using it are all available in most languages though, they often make code very beautiful.
Not all languages let you pass functions to other functions as parameters (this is called treating functions as first order type, just to sound clever). Those languages that do not let you do that are generally crap.
On the other extreme, there are purists who think you should do everything this way. These are people who think there is one best way to do something. Nuff said.
Python uses lists a lot, and provides special syntax that lets you implement the combinations of filtering and mapping in a very pretty and succinct format, called list comprehensions. Google “Python List Comprehensions”, they are really pretty.
Infinite Loops
Amazingly, and beautifully, you can’t remove this danger from a programming language: to be able to express all algorithms, a language must provide for the possibility of infinite loops. In practice, you run out of either time or memory and that stops your loop, but the infinity must be available in principle.
Some loops are infinite on purpose. This is very useful, for example, when implementing event driven applications. There you write code that runs in a loop, checking in each pass if a new event has arrived, handling it if it has, and then continuing to wait. You always add a “sleep” period to this – CPUs let you put your code to sleep for a while, to not waste energy, and a loop that only checks if a message queue is empty it makes a lot of sense; or you’d end up keeping your computer very busy checking millions or billions of times a second for something that happens once or twice a second. Loops used like this are often called message loops, even loops, or message pumps. They are at the heart of anything that has a GUI – you draw your stuff on the screen, then keep checking if something happened, till they click a button or move a mouse; handle it, then wait again. You only stop when your program stops; to your program that is the end of time, so may as well be considered infinite.
Most often they get implemented as while loops with the condition always true:
WHILE true
loop body
Functions
Functions were our first abstractions of algorithmic computations. We used them long before we knew this is what we are using them for, or that this is what they are to us. They go all the way to Descartes a least.
The reason we are only talking about them now is that while trying to find ways to automate them we learned so much about them that we now have a rich set of other abstractions to use in the conversation about them; we had to cover those first.
For centuries we only talked about functions as implementing computations involving numbers. This is because a number is a very generic abstraction that we can apply to so many things; and genericity together with succinct expression helps when there is a lot of work and few people equipped to do it, with communication spanning centuries. There were exceptions, for example, probabilities are functions that operate on sets. It took us a very long time to first understand that there are underlying ideas in common and then formulate abstractions to converse about them.
Now that we are ready, let’s finally start.
Function Type
A function is a pure abstraction, a stuff of thought and mind.
It is a thing that assigns a value to an ordered set of other values. Most often it assigns a unique value to each unique set of values. When this is the case, we call them isomorphisms, bijections, one-to-one, or referentially-transparent. In the context of programming, when this uniqueness is not guaranteed we often call functions volatile. For example, functions using random numbers or current time are volatile.
The members of the ordered set of values that a function maps to something are called arguments or parameters. The ordering is not strictly necessary, but we do need some way to keep track of the meaning of arguments. Most often we also name them, and a definition of a function will have named arguments in a strict order. When using the functions, some languages let you use just the names, most insist on the order too.
The value that a function maps to is called the return value.
Both the return value and the arguments can be of any type we can think of. The type of the return value is often called the return type.
The type of a function is called a signature, and it is composed of the return type and the types of arguments in the order in which they are declared. Brackets are often used to separate the argument types from the return type. For example, in an invented notation, the signature of a function that adds two integers together and returns an integer would be: int(int, int), a signature of a function that does the same, but also converts the result to a string would be string(int, int). For fun, a signature of a function that that takes a list of integers and a function that computes one integer from another, and returns a list of integers transformed by the function that was passed in as the argument would be: list<int>(list<int>, int(int)).
Some languages do not treat functions the same as other data types, you cannot always pass a function as an argument to a function, or return a function as a return value from another function. There is no real reason for this kind of restriction, it’s only there sometimes because it is not easy to implement this functionality if you don’t plan for it from the start; and this understanding of functions as objects is relatively new. The languages that do allow it and support it, are said to be treating functions as first-order data types.
Function Meaning and Overloading
Functions also have names. Two functions of the same type can be doing very different things, and have a different meaning. So we name them. For example, a function that adds two integers and a function that multiplies two integers have the same signature but different meaning.
Meaning is independent of types. Two functions can perform conceptually the same task, with the same meaning to us, but have different signatures. For example, adding two integers is in most contexts indistinguishable to us from adding two floating point numbers. We can also extend this if it makes sense in a context; often we think about concatenation of strings as addition of strings.
Declaring functions to have the same names but different signatures is called overloading. It is a powerful story telling technique. For example, it is easy to assign a meaning to one thing being less than another, and to implement functions that check that; and that is all we need to be able to talk about ordering, sorting, and searching through things, no matter what their type is.
Function Body
To implement a function, make it do what it is meant to do, we need to provide code that describes an algorithm that computes the return value from arguments. This is called the body of the function, or the implementation. The body of the function will refer to arguments by names that we give them. Within the context of the function body, the arguments have the type and the name, and once the function executes they will take space in memory. Within the context of the function implementation, arguments are variables. The text in the code that provides the function body is called function definition. The function body will contain code to do what the function does, and compute the return value. Most languages provide a built-in statement “return” which takes the return value as a parameter. When execution hits that statement, the function exits and returns the return value. There could me more than one exit point, more than one return statement, because the algorithm for computing the return value could be branching.
Function Invocation
You can also think of a function as a description of a procedure to compute a single value from a number of other values. You can think of arguments as inputs to that procedure and the return value as output, or outputs. To use a function, we invoke it and we pass values of arguments to it, and we get back the return value. That’s the terminology; the syntax is usually using brackets – function name followed by argument values in brackets.
Almost all languages provide something called “infix notation” or operators, a special syntax to invoke some functions so that you can write code that looks prettier. For example, all arithmetic functions are written like this, using +,-,*,/ in between arguments instead of writing things like add(x,y). This is a kind of syntactic sugar; the first thing a compiler does is convert infix notation to function invocations, it’s just syntax for beauty.
Uniqueness and Encapsulation
A function is uniquely identified by its name and signature.
To use a function, or to talk about it, this is all we need to know. Many languages allow you to declare the name and the signature of the function separately, and just provide hints to the compiler about where the definition is. This is called a function declaration. There can be many of those, any context that uses the function needs the declaration. In contrast, there can be only one definition for each function in a program.
The definition (the implementation of the function) could be in a different file, or even in a different binary library – often you don’t even get to see the code that implements the function; someone provided it in a library and you just trust them. This helps keep the context abstractions purer. Often, the part of the compilation that finds and hooks up definitions with function invocations in your code is done separately, and often it is even a separate application called the linker. The linking step can either merge the compiled instructions that are the implementation of the function into your program, or add instructions that invoke the function from a separate dynamic library. Because linkers need to be able to find a function definition, the definition will always also contain the declaration.
Notice that names of the arguments are not necessary to uniquely identify a function – their types and the order in which they appear are enough. Most often we still provide them in a function declaration, because their names explain their meaning.
Neither is the body of the function necessary to identify a function: a function is a pure abstraction, the same results could be achieved in many ways. We don’t need to know how a function does what it does, as long as we know what it does.
This is an important point. This is why we use functions in the first place. There are many names for it: encapsulation, modularisation, divide-and-conquer approach, hiding the implementation details, and more. Ultimately, they are a tool to introduce new abstractions into a context.
Side Effects
Sometimes we don’t explicitly list all the arguments to the function, and sometimes we don’t explicitly declare all the return values or even make it visible who or how will use some of the return values. For example, a function may rely on user input, a global variable, or on the content of a file, a database, an internet connection, or anything else that is not considered a part of the context we are in. Equally, it could send a message to the outside world, write data to a file or a global variable, on the screen, and so on. These are called side-effects. Functions whose inputs depend on side effects are not referentially transparent. They are also harder to keep track of; they are a pain, because you cannot know about them just by looking at the function declaration. Generally you guess from the function name (function write_to_database is likely to have side effects), read the documentation, or find out when things start breaking.
On the other hand, a program without side effects is an artistic exercise in using electricity to generate heat. If it doesn’t produce outputs, that is really the only thing it can do. So, you don’t want to not have them.
People spend lots of time coming up with patterns and recipes about how to minimise the number of side effects, call them evil and bad practice, smirk and grumble. Oh well, people do people. In the end, you should only try to avoid side-effects if you think they make your code harder to comprehend.
Composing Functions, Lambdas
The most important thing, other than invocation, that you can do with functions is compose them. This is often made into a big deal, and I am not sure why.
Functions invoke other functions, or take the return values from other functions as arguments. Sure, obviously. We compose abstractions into more complex abstractions.
I guess it is important to recognise that you can express any algorithm using functional composition, as long as you allow recursion.
It is useful to remember function composition as a stylistic guideline. When the text of your code becomes long, it’s harder to keep track of all that’s happening, so you break it up into functions. When you find yourself repeating a passage (a block of code), you name that process and you wrap it into a function. Then it’s often easier to read, but also if you have a bug in that block, or just want to change what it does, you only got to do it in one place.
Many people believe that purgatory awaits as soon as you write a function longer than some small number of lines (20-50 usually). To me that is the same as insisting that in a novel your passages and sentences must be short or it’s a bad novel; then again, some people think that too, so…
Anyway, back to abstractions. Function composition happens all the time. You can build abstractions upon abstractions using it, and most languages allow you to pass a function as an argument to another function to do that. We talked about this in terms of iteration, but it is a very general technique. There are very often patterns of function executions that are applicable to many functions, we can write quite generic algorithms and implement beautiful patterns to deal with those.
Often, as it turns out, in these situations we end up writing very small functions only to apply them once, using one of these patterns.
For example, say your program holds a long list of unsorted things. There isn’t an obvious way to sort them, so when you search through them, you always have to do a linear search. Then, as often happens, say that the data in the list is of some complex type, each element with many properties. You can think of it as a database table with many columns. Now, in different situations you would want to be able to query this table in very different ways. Find all rows where the fifth and seventh column are the same, and the third column has a particular value. Or anything. Whatever your condition is, you would always do the same thing: go through your table row by row, check if it satisfies the query condition, if it does, make it a part of the return value, if not, just move on to the next row. The only thing that changes is the condition. Well, if you write a function that takes as input an entire row and returns a Boolean value that tells you if the row satisfies the condition, you can provide a very generic function for any query people may want to run.
The condition function itself is the part that is specific to the query, you are very unlikely to use it in any other situation, it only makes sense in the context of that particular query; it is also a very simple one liner. In our ad-hoc invented syntax (because details vary from language to language, this could look something like this:
function match_row(row):
return row.colums[5] == row.columns[7] AND row.columns[3] = 10
Then if your querying function is called “query”, you would invoke it as:
query(match_row)
Nice, but to understand what this line does, you have to scroll up and look at the definition of match_row.
At some point, Python implemented a notion of lambda functions to deal with this. It’s a special syntax that lets you specify a function with no name, and it’s main purpose is to be passed to higher order functions like the querying one above. You can even specify these functions inline as a part of the line that invokes the higher order function. It’s a bit weird till you get used to the syntax, but it does allow for beautifully readable text – the context information is all there in the single line. Nowadays C++ also has this, and many other languages. The idea did not come from Python, it would have been some of the other pure functional languages, but that is what made it popular.
Syntax varies a lot from one language to another, but in our invented syntax, say if the function implementing the query as above was called query, you could write something like:
query (
row -> row.colums[5] == row.columns[7] AND
row.columns[3] = 10 )
It’s very pretty, and another example of syntactic sugar: the compiler creates a function for you on the fly, gives it some gobbledygook name (like __lambda__0109), and it works just like the function match_row in the example above. Just looks pretty, because you don’t need to name this function: it’s body is so simple, it becomes its name.
Passing by Value or by Reference, Constant Arguments
We talked about this when discussing variables, it was hard to not because from the point of view of a function body, arguments are variables.
I guess the best thing to do now would be to read that discussion again, it will make more sense.
In short, arguments can be passed to a function by value; in which case a copy of the value is created as a new variable for the function body, or by reference, in which case it is just a new name for a variable that already exists in the context from which the function is being invoked.
Passing by reference is generally more efficient – no copying of data happens, but it gives the function implementation access to a variable that belongs to a context that is beyond it. In other words, the function can change the values of variables that do not belong to it. It’s another kind of side effect, mostly frowned upon, but hell, it can make things a lot faster. To help clarify the intent, many languages introduce a concept of a constant reference: a values is passed to a function by reference, but the function is not allowed to change it, just use it. So we get the efficiency without side effects. It is nice and kind to declare your arguments as constant references, unless you are passing them by value or you actually intend to change them. Makes it clearer what your function is all about.
Languages make choices about how much you can control this. In C++ you can control every aspect of it, in Python none of it – Python just passes simple small stuff by value and big complex stuff by reference (it has simple and clear rules about it).
At the machine level, passing values by reference is implemented using pointers – what actually gets passed to a function when using by reference semantics is a pointer to the original variable. Pointers in turn are just integers, so when passing by reference, in the implementation on the machine, you are passing an integer by value; and there is still some copying of data happening, albeit as little as possible. All POD types are as fast to copy as pointers, or faster (because they are all encoded as a few bytes); and so, there is little point in passing POD types by reference, unless you actually want your function to change someone else’s variable.
Why would you ever want to be changing someone else’s data, it seems rude?
The most common examples are situations when the data needs to be changed in some complicated ways, but the change is only one of the steps in a larger computation. So you pack this transformation away into a function to not burden the text of the larger computation with too much detail. If the data you are changing is not of a POD type, you don’t want to have to pay the price of copying it into and out of the function. Modern languages and compilers attempt to implement this sort of optimisation automatically; so sometimes they change your function to work with references even though you wrote it to work by value. They have to err on the side of caution though and detecting intent is not always simple, so they don’t always work and are certainly not guaranteed to work. Modern C++ lets you exercise a lot of control over this kind of optimisation directly, but to do that it introduces ever more complex constructs into the language … I am yet to be convinced that it’s worth it.
Subroutines, void type
In the early days, there was a very clear distinction between functions that return a value and those that only implement side effects, only do things. For example, what sensible value can be returned from a function that prints some text on the screen? Or one that changes the colour of a light?
The functions that do not return a value are called subroutines. There’s also something called co-routines, functions that depend on calling each other.
It’s barely ever used these days, mostly because we like to generalise abstractions, and this distinction feels useless when thinking about algorithms. You invoke a function to do something, if you don’t care about the return value, just don’t use it.
In C, the way you express this is by declaring the return type of a function to be “void”, it’s a keyword in the language. Later as we learned more about data types, everyone agreed that void is best considered a data type. It represents a place where the values we don’t care for go to die. There are similar concepts in some languages, like none or null; subtly different, they mean that we don’t know the value or a type of a variable, void means that we just don’t care, never will care, and can’t do anything about it even if we did care. It’s a nihilistic type.
These days, if you really must mention it, when you talk about a function that returns no values, you just say that it’s void.
Machine Level Implementation of Functions, Call Stack
Functions are, as I keep saying, the most important abstraction of them all in programming. But CPUs have no concept of them, to a CPU it’s all just some instructions one after another.
All this syntax and semantics surrounding functions has to be translated into that context by compilers.
There are some minor variations, but in general it always works along the same lines. I can’t explain this in a completely linear fashion, because what we are implementing is jumping between contexts – the calling context (the code that invokes a function) and the execution of the function itself. They are made as independent as possible on purpose. You may have to read this in a couple of loops.
First, we agree on the concept of a stack-frame. It is a package of data that contains all the bytes representing values or addresses (if passed by reference) of arguments passed to a function when it is invoked; and the address of the assembly instruction immediately after the one that invokes the function. That last bit will become clearer, I promise.
Now, a compiled function is a list of instructions stored at some address in RAM. This is the address of the first instruction in the body of the function, and we call it the pointer to the function.
We also agree on the following convention: when a function starts executing, it will find all the values of its argument in a stack-frame at the top of the stack. Compiler will inject code at the start of the function body to unpack the stack frame and get the values of the arguments. Then it will place the instructions of your function body right after it, making sure that these instructions are generated in a way that assumes that the arguments are coming from the stack-frame. We have a way now to pass the arguments to the function and use them.
For the return value, if there is one, the compiler finds some space in RAM and puts the value there. There are two conventions about how to tell the invoking context where that is: the pointer to where the return value lives is placed either into register A (first of the general purpose registers), or pushed on the stack. We have a way to pass return value back to the calling context. This is the penultimate thing the function body code does.
The very last thing it does is to get the address of the next instruction in the invoking context and make the CPU jump there. (yes, this is another kind of branching, or flow-control – the flow of execution jumps about across instructions).
Cool, we can execute a function, it knows where to find its arguments, where to put the result, and how to transfer the flow back to where it was invoked from.
What happens in the calling context, where the function is invoked from?
First, the values that describe the calling context are pushed on the stack. This is how the state of the context is saved. Sometimes this data is called a context frame, mostly it’s also called stack frame. It would include values of all local variables, and if this context is also a body of the function (like it mostly is), then the context frame will also include the function’s stack frame. In short, it’s all the data that is pertinent for the current state of execution.
Then a stack frame for the function being invoked is created and pushed on the stack. Then a jump instruction happens, that makes the code jump to the first instruction of the function being invoked. The return address in the stack frame is the address of the first instruction after this jump instruction.
At this address, first there are instructions to recover the return value of the function from wherever the function put it. Then come the instructions to recover the context frame data from the stack. At this point, the state of the calling context is restored, and it also contains the return value of the function that was invoked, and the execution continues.
I know it is involved, but it is very generic and efficient. At a slightly higher level:
- The calling context pushes all the information about its own states onto the stack.
- The calling context pushes the data needed by the function onto the stack (stack frame).
- The execution jumps to the function body location in memory.
- The function body code recovers the values of its arguments from the stack.
- The function body code does its thing and stores the result somewhere.
- The function body puts the address of the result either on the stack or in a register.
- The execution jumps back to the calling context.
- The calling context gets the pointer to the return value.
- The calling context recovers its state from the stack.
- The calling context continues to do its thing.
Some languages generalise the notion of a stack-frame and call it closure; makes it separate from the fact that a stack is used.
The Call Stack
With this mechanism, as functions keep calling each other, all their relevant data is stored on top of each other on the stack. The stack will also contain all the local variables of the currently executing function (even for the data allocated on the heap, the addresses will be stored on the stack).
In other words, at any one point in time, all the information about the entire chain of functions from the very start of execution till that point in time is stored on the stack; stack frame on top of a stack frame on top of the stack frame. It can be hundreds deep.
This is called the call stack.
What’s great about it is that you can look at the stack at any point and reverse engineer everything that happened in the execution so far. It’s called walking the stack. In life, you end up doing this all the time to debug your code.
Debugging
There are special applications called debuggers that are built to help with this (I mean, manually deciphering bytes of a stack frame is just too much). CPUs are built to support this, so are operating systems. There are special instructions built into CPUs to pause execution, there is code built into the operating system to then store the data from the stack and allow debuggers to inspect it. They also then provide functions that debuggers can call to control execution of your program completely – execute next instruction, execute all instructions up to a given one then stop (this is called a break-point), and so on.
Originally debuggers allowed you to step through the assembly code step by step and inspect the state of the memory, stack, registers at every step. That was already very helpful, but only if you know assembly and can jump between the context of your program and the machine implementation … it helped but it was slow and painful. These days, between compilers, operating systems, and debuggers we can do way better.
Firstly, while compiling your code, compilers also build symbol files – these map your “symbols” (the names of variables, classes, and functions), to memory addresses in the compiled code. Operating systems provide standard ways to build these files, and ways to map between actual addresses and the ones in the symbol file (because you don’t know where your code or variables will actually end up in memory, OS decided that on the fly). Debuggers use all this, and the source code you wrote in the programming language to hook it all up.
What you end up with is the ability to run your code in an application that looks like an editor and displays your code; and the ability to click on a line of your code and set up a breakpoint there. The code will then execute and just stop when it reaches the break point, and other windows will then show you the values of your variables, let you drill through them if they are of complex types and explore their constituent parts, and then step through code line by line, keeping track of what changes, till you spot something wrong. It will also show you the entire call stack in a different window, and you can click on any of the previous functions in the chain and inspect the state of variables there. It will let you run to another breakpoint and stop again, step-over function calls or step into them, sometimes even change the values of variables manually just to see what happens. You will be doing this to check if your code does what you hope it does, to find bugs in your code, and to understand how code written by other people works. You will learn more about programming by doing this than anything else. You should learn how to use debuggers as soon as you learn how to write your first function.
The first debugger that allowed this much flexibility was called gdb, written for Unix and working on a command line. You could do all this lovely stuff, but all of it by issuing commands in a terminal window, there was no visual interface. It’s still around and many (old) people use it. Then Microsoft built Visual Studio, a complete environment for writing C++ code, that included a visual debugger and it all changed. There are many development environments now that can do this, many of them actually interfacing with gdb to provide the debugging functionality for one or few languages (Eclipse, IntelliJ, PyCharm … ). There is also Visual Studio Code – it looks a lot like Visual Studio, and is written by Microsoft, but it is free, written from scratch, open source, supports any language you can think of through plugins, and runs on Linux as well as Windows. It’s also faster and more responsive than most other ones. It’s actually great.
To provide all this functionality, your code is first not optimised by compilers (more about this later), and secondly a lot of other stuff happens (symbols mapping and loading, reading the stack, reverse engineering the data on the stack etc); so running your code in a debugger is a lot slower than it would be if it was just out in the wild. A lot needs to be done to your code too to make all this work. It’s all done by compilers, but makes your code bigger and slower. So, for compiled languages, you will generally always build two versions of everything – one for releasing (release build) and one for debugging (debug build).
These methods of debugging are so important and well understood, that even the languages that are not fully compiled provide support from them; like Python does. Even though Python code is never fully compiled to implement a call stack on the machine, it keeps its own one, so all the above still works the same. So does Java and almost every other programming language you come across. There are exceptions, they are mostly either evil, stupid, or so small that it’s not worth implementing debuggers (e.g. various small DSLs).
It is always worth keeping in mind that the second most popular method of debugging is adding print statements to your code that output the values of variables you care for during execution. You don’t need a debugger for this and at least half the time it’s as fast or faster – because you know your code, you already know where to look for bugs.
You know where to look so well, that often you will print such information out anyway, just so that you can look at it if your code crashes once released, to try to figure out what went wrong. This is called … logging 🙂
However – when someone calls you to tell you your code is wrong or crashing; none of this helps until you manage to recreate the problem they are seeing, so that you can step through code and see it going wrong. The first thing you always ask them has to be: gimme the inputs you used; or stay on the phone and keep telling me what you did till I make it go wrong myself. When you are implementing logging, this is what you are doing it for: when shit hits the fan, you hope that your logs will have enough information to re-create the problem.
Recursion and Stack Overflows
When it all becomes too deep, you run out of stack space to store all this; this is called stack overflow and your program crashes, there’s no way to recover from it. This happens, for example, when you write a recursive function that doesn’t terminate, but instead keeps calling itself over and over again. In fact, most of the time you see a stack overflow, that’s what it is.
Let’s (finally) play just a little with recursion.
The one example everyone always does is calculating a factorial of a number – product of all integers up to the given one, notation is ‘!’. So, 2! = 1*2, 3!=1*2*3, 4! = 1*2*3*4 …
Now, there’s a pattern here:
For any integer N:
- If N is 1, N! is 1
- Otherwise, N! = N * (N-1)!
Yey, recursion!
A function to do this would look something like:
factorial(N):
if N == 1: return 1
else: return N * factorial(N-1)
Nice, simple, and pretty.
But it keeps calling itself for every step, and each invocation uses up stack space. Invoke this for a large enough number and you are guaranteed to see a stack overflow.
If you make a mistake, it could be even worse, imagine if you mistyped it as:
factorial(N):
if N == 11: return 1
else: return N * factorial(N-1)
Now invoke it factorial(10). The numbers keep getting smaller and N is never 11. It will keep invoking itself, missing out on the termination condition, forever – that is, until stack overflow happens.
Almost every time you see a stack overflow there’s an infinite recursion. It could be more convoluted, two functions calling each other, or a chain of several functions going in circles. Luckily, this is easy to spot in a debugger – look at the call stack, it will be obvious.
Exceptions
When you write code, you soon become aware that things can fail in ways that you can recover from. You can try something else, or ask users for better inputs, or just tell them it can’t be done, or even just write information about the error to the screen or a log file and die. You start writing code that checks for errors and handles them.
When things are simple, one way of doing this is to make your functions always return some sort of status code. Then when someone invokes your function, if status code is anything other than “OK”, they know it’s an error and can do something about it.
Alas, because call stacks get very deep, the level of context that knows how to handle an error may be 20 functions up the stack. Because functions get written by different people, delivered to you in different libraries, you can’t even insist that the error codes should be propagated up the call stack consistently.
Many languages implement a special mechanism for this situation, called exceptions. They are a feature of a programming language, built into it, and built into how the function invocation is handled.
An exception is an object, of some type (mostly this can be anything), that contains information about an error that was encountered. If you know what to do about an error that could happen during an execution of a function you are invoking, you place the function invocation within a “try-catch” block. Syntax differs between languages, but something like this:
try:
v = function_f(argument1, argument2, … )
catch excptn:
code to handle the error
the rest of your code
If the function_f executes without errors, it just returns a value, it gets assigned to v, and then the execution continues from the line “the rest of your code”. But if it encounters an error, instead of returning, it constructs an exception object and “throws” it. There would be a statement built into the language “throw”, works like “return” but is used for passing exceptions out of the function. You would write something like “throw exception_object”. The function execution is done, much like when you return from it. But … upon return, the execution immediately jumps to the first line after the “catch” statement, and the value of the exception object is assigned to exp. Then the “code to handle the error” is executed, and then the execution continues at “the rest of your code”.
So far, it’s just like a special return value with a bit of switching built in. When it becomes interesting is when there’s no try-catch block around the function call.
When that happens, the exception is automatically thrown up to the next function in the call stack. If that function had the try-catch block around the invocation that led to the exception being thrown, it’s handled there. If not, the same thing happens again. The exception propagates up the call-stack until something handles it. If nothing does, the program crashes. This journey of an exception up the call stack is sometimes called stack unwinding.
It’s quite handy really, if something goes wrong, you just throw an exception and let whoever cares deal with it. If you care, you handle it, otherwise you just ignore the possibility that something that your code calls could throw an exception. Normally you can even do both – do what you can to handle errors, but throw them up the call stack anyway if you don’t know how to recover, maybe someone else does. Happy days. Many languages (like Python) will even keep track of the exception’s journey and by default print out the names of all the functions on the entire call stack up to the point where exception was thrown, and the line of code from which it was thrown. Super helpful for debugging. Debuggers can be made to stop when an exception is thrown – as long as you can recreate a problem someone is reporting with your code, you can use this to diagnose it and then fix it at great, even legendary, speed.
This technique came to prominence with C++, but everyone does it now. There are entire hierarchies of exception types provided in languages that are meant to standardise types of errors that happen: you may know how to handle an IO error, but not a floating point overflow, so you handle some, re-throw the others. People better than me learn these in detail and work hard to throw the appropriate exception type, hoping that people who write functions to handle them will learn them in detail too and write code to handle every error appropriately. Life is too short. Almost always, an error is just an error, print information about it and either try again or give up, nothing much more that you can do. For when you really, really care, read what documentation you can find about the function you are using; people of strong moral fiber list what exceptions their functions can possibly throw and when.
Flow Control, goto Statements, Threads
We used this metaphor before, this is just to consolidate it a bit and discuss a few related bits and pieces.
The code execution by a CPU is generally quite linear, from one instruction to the next, as they are laid out in memory. That is, however, very boring. To do interesting things, like branching, looping, exceptions, or function invocation, the flow of execution needs to be able to sometimes jump around; instead of executing the next instruction, hop to somewhere else in memory and continue execution from there. All these things are called flow control constructs.
goto
There is one we haven’t mentioned yet: goto statements. Early languages all provide them. They are the direct equivalent of CPU jump instructions. You can’t know the address of an instruction compiler would generate, that you would want to jump to, so these language allow you to add labels to the code, just names for lines that you can jump to using go to statements. You can still do this in most languages, but it becomes really woolly really fast, because the structure of the code is never linear – how do you jump across functions that in your mind are just independent abstractions floating in space, not lists of instructions one after another. The term spaghetti code was first coined to describe overuse of goto statements, jump here, jump there, back, forth … quickly the text of your program becomes too hard to follow.
If you use them, you risk being immediately ostracised and ridiculed. Why are they even there then?
There are situations where they make code nicer. If you happen to have a long function that does a lot of processing that can fail in many places, but from which you can recover easily and in the same way; and at the same time you care about the speed a lot. For example, imagine an event loop that handles a large number of events. Even though it makes the function long, it really helps to have a single large sequence of case statements (or else-if), one for each type of event. Even if all they do is dispatch handling to other functions, you now have a good starting point that lists all your use cases and from where you follow the thread of execution for each while reading the code or debugging. If your handling for an event fails though, typically you don’t want to just stop execution and leave all the other events that may arrive hanging. So your error handling is going to be simple – record the problem, move on to the next event. Now, event handling often needs to be real time, which is to say, microseconds count; and the exception mechanism, pretty as it is, is actually quite slow (unwind the stack, pass data around, unpack data, handle it … ). What to do? Stick a label somewhere in your code and when you see an error, jump there, handle it and continue. It’s the fastest way to do it.
Having said all that, I only ever did this twice in my entire career so far.
Execution Pipelines
The model of execution where instructions get executed in order one after another opens a door for a cool hardware optimisation in internal CPU architecture – instruction pipelines.
CPUs waste a lot of time while waiting for the next instruction to be fetched from RAM, and even from level 1 cache, and delivered to a core for execution. Cores are insanely fast, faster than all that. So, if you know what instruction will be executing after the one the core is currently dealing with, you could go and pre-fetch at the same time, so that it’s there and ready, right next to the logic gates when needed. With a bit of effort, you can even build small pipelines to prefetch some more, like a queue – the core keeps getting them, you keep pre-fetching them. You try to keep the pipeline full, and all is as fast as can be.
Flow control statements make this real hard. You don’t have much time to spend figuring out what instruction you should fetch, gotta be fast, and if there is a flow control change happening … oh well. CPUs try to guess still, but mostly they just give up, clear the pipeline and start again when they know what’s happening.
Not really visible to you, but when people start talking about highest level optimisations of code, they would take this into account. A common technique to help pipelining is loop-unwinding. Instead of writing a loop that does something a number of times, you copy-paste code a number of times. Ugly, but linear, no jumps, happy pipelines. If your loops do many repetitions, you only unwind them partially – you repeat your code five times, and divide the number of loop repetitions by five.
Some compilers can do this automatically as a part of code optimisation, some languages provide ways to wrap this into some higher level constructs. It’s never pretty, but it does help when you really have to care about speed that much. It is often done in libraries that deal with matrix algebra (used a lot in machine learning and model calibration and pricing in finance).
Multi-threading
It is a fact of life that CPUs spend most of their time idling. Executing NOP instructions (meaning, no-op, do nothing). This happens so much, that most often the opcode for NOP instruction is 0, because that means that when passing that instruction around, all the bits on the bus are at low power; it saves a lot of electricity.
On the other hand, this abstraction of execution happening along a thread of instructions is a nice one.
What if, instead of one thread of execution, we had many? Then we can switch between them and execute them in turn. Because most of the time each thread is idling anyway (we are talking in nano-second steps here), this would give a really good illusion of the threads being executed simultaneously. Yeah, there is a cost to switching threads, the context changes all the time – but if you consider each thread a function, then this just means moving the instruction pointer to the right place and moving a stack pointer to the right place (each function keeps track of its own area of the stack). There’s a little bit more, but it is fairly quick to do.
This is how multithreading works.
Operating systems do this all the time – this is how they switch between processes (programs); but they all provide means for programming languages to create their own threads on the fly.
The syntax varies wildly between languages, but in general you write a function that is to be executed on a separate thread, and then pass that function to a special function or a language construct provided for you and it just happens.
Why would you want to do this?
Most often it is about GUI responsiveness. User clicks a button, you have to perform a very long computation, or have to send a message to a remote server and wait for a response. If nothing happens on the screen, users get bored within a second, antsy within three, by the tenth second they start clicking randomly (real numbers, people research this). You got to keep them entertained. So you kick off a thread to do the work (worker thread), and on your main thread you draw hourglasses, animations, status updates, or something, while waiting for the worker thread to finish. Or you even let them continue with whatever they want to do, and then interrupt and update their screen when the worker thread is done.
Another use case when you know you have a multi-core CPU to work with (these days you mostly do). Then you can actually parallelise complicated applications. When there are multiple cores, operating systems will distribute threads across them.
It becomes nasty when multiple threads have to write and read the same data during their runs. Because you have no control over when the execution switches from one thread to another, it could happen at the time when one thread is halfway through reading data, and another is halfway through writing it. If you don’t do anything about it, you end up reading some weird mess.
What you do to deal with this is block the execution of all threads accessing the data while you are doing it, then unblock it (you mark your code to be a “critical section”). Operating systems provide functions to do this and there can be many such blocks to block different parts of execution. But if you are not careful, you can end up in situations where both threads are waiting on each other to remove a block – a deadlock. It is too easy to do this.
There is a whole lot of detail, terminology, and standard techniques; if you write multithreaded code, you have to read up on it first. It’s considered a specialisation almost, a secret skill – it isn’t, it’s just boring. Google “thread synchronisation”, read the first few blogs quickly, and you will know enough to talk about it to anyone (you should do this).
When coding with threads, the best strategy is to just not share the data like that ever. Pass all the data that is needed as constant to your thread function, then ignore it till it’s done. Make the worker thread function write outputs somewhere safe, and only read it once the function is done. If you can. If you can’t, read up and test a lot.
Or even better, don’t use threads. restructure your code so that you can use processes instead. Chrome does this. Instead of spinning threads, you spin an entire new program. With modern operating systems it’s almost as fast. Different programs can’t share memory space (well, they can, but it’s really hard), and there are different ways to pass data between them that are much safer. Unless you are dying for speed, just use files – write inputs and outputs to files and read them and write them.
Basically, use threads only when it’s simple, when it makes things beautiful, or as a last resort. They are a way to fix imperfections in our implementations of computers, and that very rarely maps into things you actually want to do.
Object Oriented Programming
Object oriented paradigm, in its strongest form, is the view that the best way to clear analyses of problems that can be implemented efficiently is by relying solely on categorisation of data into types; which somehow always arises through innate common understanding of the true nature of the problem at hand.
It is wrong in many ways, but it did help provide a slew of useful terminology, techniques, and tools for programming computers.
I spent a lot of time thinking about how to write this section without relying on examples, and still make it clear. I don’t think I can. The ideas, even the core principles, are only half baked; the concepts themselves are not clear or precise. Object oriented programming is as much about the management of large projects’ delivery and management, as it is about the beauty of expressing abstract thought. The two are often opposites, and this confusion in intent makes a mess of the domain. It is however in the focus and at the core of most conversations about code, either explicitly or because there is a popular assumption that this is the one true way to think, write, and talk about code. We have to cover it in great detail, or you simply won’t be able to understand what people are talking about, or think they are talking about; as is most often the case.
First Steps
It developed organically. Many of the early programming languages were tightly focused on processing data; most famously COBOL. In such domains, it quickly became clear that grouping individual bits of information in various ways into “records” helps. In a business system, we would consider an employee, not only by their name, but we would think of them as an object, with a name, date of birth, years of service, specialisations, an address, desk location and so on. When dealing with payrolls or holiday allowances, it was easier to deal with collections that would refer to all these parameters all at once. Other languages soon adopted this view and then in C you had a dead simple way to create structures – they call it a “struct”, and it is just a named collection of some named variables. Something like this (syntax is close to C, C doesn’t have a string type, but I don’t want to distract you with details):
struct Employee{
string name;
string surname;
int day_of_birth;
int month_of_birth;
int year_of_birth;
int house_number;
string street_name;
string town;
string job_function;
};
You can now declare variables of type Employee, Employee e. The component variables are called members, and you can access them simply, using a dot: e.name, e.surname …
In memory, these are laid out as a sequence of bytes, bytes for each member one after another. This makes them easy to use, copy, and push on the stack. Great. In code, copying is also simple, e1 = d2. They look, work, and feel just like any other data type. We have a way to define our own data types at will.
Composing Types
It is also a consistent framework, a data type is a data type, so we can start building more and more abstractions:
struct FullName {
string name;
string surname;
};
struct DOB {
int day_of_birth;
int month_of_birth;
int year_of_birth;
};
struct Address {
int house_number;
string street_name;
string town;
};
struct Employee {
FullName name;
DOB date_of_bith;
Address address;
string job_function;
};
Now we’re talking. I mean literally’ we are writing descriptions of things bursting with meaning within the context, we are talking about the concepts in the domain, not about the implementation of them in the machine.
Enumerations
We can refine it more easily. Job_function being a string is annoying – it can’t be just anything, there is a limited set of valid ones.
C adds another way to define new types, for just that occasion, enums. When there is only a limited set of meaningful values within a type, you simply list them and that’s a new type:
enum JobFunction{
DaBoss,
ManagerOfManagers,
Manager,
Minion
};
Compilers translate this to numbers, DaBoss is 1, ManagerOfManagers 2 etc Small and easy to move around. The fact that these can only take a certain set of values only matters in the program source code, and compilers can easily check that and report it back to you. We now have:
struct Employee {
FullName name;
DOB date_of_bith;
Address address;
JobFunction job_function;
};
We can create other structures that use FullName or DOB, we can create lists of employees … anything we like. We can talk about things, free of the details of the machinery. Omg, omg, omg, we so abstract now. We so abstract we can tell stories.
Self-referential Types
C added another little feature that freed us even more, the structures can be self-referential, look:
struct ListNode{
string data;
ListNode* next;
};
* in C means “pointer”, so ListNode is a pointer to another ListNode. We just implemented linked lists of strings. Commoooon, this is a playground now. (btw, C does not have a type string, this would not quite compile, but we don’t want to nit-pick here).
There’s annoying limitations, but they are not huge. In a self-referential type, you have to refer to the self-type using a pointer. And this is only because of how computers work. Once you create an instance of this, you have to allocate memory for it. To do that, you have to know how big it is, but if it contains itself, it’s infinite in size. A pointer on the other hand, is always just a pointer, always the same size (generally, the size of the bus). The fact that it’s a pointer to ListNode and not a pointer to Address doesn’t change that – that part only matters while the compiler is checking if your code is confusing types; at the machine level all type information is gone. You get used to it, and there is an interesting consequence that opens many doors.
Casting, void*
You can force the compiler to convert between types. It’s called casting. Changes nothing at the machine code level, but it tells whoever is reading your code that from that point on, the same bit pattern is to be interpreted as something else. It’s a brutal way of doing things, later languages added a lot of safeguarding around it (to try and protect you from foolishly casting things into each other when reinterpretation will not work, things are too different), but in C it’s just raw. It is particularly useful when it is used on pointers. C added a special kind of pointer because of this, void* . It means “a pointer to something”. I hear you wonder: how is this useful, it removes information about my beautiful abstractions and reduces them to some numbers? Watch.
struct ListNode{
void* data;
ListNode* next;
};
What I have now is a complete abstraction of a linked list. When I am putting the nodes for my list together, I can make data be of any type I like. I can write functions that implement algorithms, like searches and traversals, and whatnot, that work on any type of linked list.
There is a whole lot of annoying details in C syntax that make code that uses these things ugly; for example, to use data in the node as a string, you’d most likely end up writing something like (char*)(node->data); to access members via a pointer variable you have to use the weird arrow instead of dot, and C doesn’t have a real string type, a string in C is an array of characters, you talk about it by talking about the pointer to the first character in the array. There is more, a lot more – C was really written as a better assembly, it focuses on the machine a lot. It’s a low level language, it’s close to the metal (in geekspeak).
But look, even there, and even with these crude and simple implementations of data abstractions, the floodgates were open: we started looking at the structure of data as a crucial part of modelling the world with abstractions. We talked in a new language, and soon enough we developed a whole new terminology and understanding of this type of thinking.
Interfaces
We looked at these abstract data types, and first we realised that with each of them there will be a set of operations that define what they are far better than how we store them.
With a list, we know that it is easy to add things to the end, and to insert things in between two elements; that it is always easy to split it into a head and the tail, and that the tail is also a list, same in all respects other than length; that you can traverse it by starting from the front and moving one step at the time till you reach the end. When you are using a list, this is all you actually care about. You can choose to implement it in other ways, as long as those operations are possible (for example, you can implement a list as an array, with “next” being an index of the element in the array where the next element is stored – some old languages did this).
What we would normally do is write these operations as functions; with an implementation of a list, you would get functions like list_append, list_insert, list_head, list_tail, list_traverse. You would use this without ever caring how the List type is actually implemented.
Encapsulation, again
Eventually we took a real strong view on this and decided that you actually shouldn’t even know this. Because if you do, you’ll start using this information, and then if anyone ever wanted to improve the implementation details, they can’t because of you. Besides, why burden you with unnecessary knowledge, just let the people whose job it is to implement lists do their job; and you care about your abstractions. We called this idea that implementation details should be hidden from users: encapsulation.
It is a decent stylistic guideline; in the world where you import and buy implementations of things from others, it is best to not be exposed to changes they make in the next version, of course. It is however, like so much else in the history of programming, turned into an unquestionable commandment because it helps manage teams of programmers without having to understand their work.
Anyway.
We also gave a name to the set of functions that let us use objects of a particular type: interface.
Polymorphism
Now we got ambitious. We realised that there are aspects to things. A person is somebody’s child, can be somebody’s parent, a partner, a lover, somebody’s destiny, somebody’s friend, or a boss, or a minion … many of these all at once. Depending on the sub-context in which we are considering them, only some of these aspects matter. We ended up often providing multiple interfaces for the same objects; allowing them to morph between data types that those interfaces define, to suit our needs. We call this polymorphism.
Inheritance
After that, we looked deeper and realised that while data types categorise the world into sets, often these are related – some sets are subsets of others. A circle and a square are distinct, but they are both shapes; a list and a tree too, but they are both containers of data. We could decide that a shape is something that has width and height and a position in a coordinate system; a circle is then a shape for which all points on the perimeter are the same distance from the point that specifies the position – in addition to all the properties of the shape, it also has a radius. A triangle is also a shape, but also has three lengths. A circle and a triangle inherit all the properties of a shape, and some more. We can use this to write algorithms that work on any shape, and then some that only work for circles or triangles. This sort of relationship is all over the place. We call it inheritance.
“Fundamental concepts”
And so, we came up with four “fundamental concepts” of object oriented programming: data abstraction, encapsulation, polymorphism, and inheritance.
If “fundamental concepts” sounds fundamentalist, it’s because it is. These are important notions, they help communication a lot, they provide guidance about style, and help avoid some common pitfalls when writing code that will be used by many other people. They speed up delivery and can help manage projects; in large groups and on your own. A clear definition and categorisation of concepts within a context, divorced from implementation details will often do that. It’s great, it really is, a whole new language to talk about abstract concepts and their implementation and relationships. But it is not a religion, nor does it always work the best. Only saying this because the view that “all you gotta do is define your interfaces to look like the real world and the code writes itself” is so prominent it hurts. Yeah right, because what the world looks like is objective, and algorithms are easy, and imagination plays no part in it all, and just learn these four simple rules properly and off we go … sure mate, sure; get me some of what you’re having.
Anyway, again.
A lot of work was done to figure all this out, a few languages were written to provide ways to do various parts of this and still keep code efficient (it is not easy translating things this abstract to efficient machine instructions).
Then Bjarne Stroustrup borrowed from other languages, thought a lot, and started adding features to C to support these concepts directly. He called it C++. By that point, C was by far the most commonly used programming language around. C++ was a huge success, it changed everything. It was so influential that most of the terminology about the technicalities of implementing OO code come from C++. Many languages do it now, and they all follow the terminology, and patterns, introduced by C++. For a while now, C++ is being refined and updated by an international committee, chaired by Bjarne, and they publish C++ standards, definitions of everything in various versions of it, that others than implement in compilers.
We’ll spend a little bit of time on how C++ does it. There is special syntax for most of this, we won’t bother with that, that’s learning C++, we want to learn ideas here.
Member Functions and this pointer
Firstly, it adds the ability to have functions as members of structures. That way, all the functions that constitute an interface to a type are associated directly with the type. So our list would now look something like:
struct ListNode{
void* data;
ListNode* next;
void insert_next_node(void* new_data);
void remove();
bool is_last();
};
Don’t worry about the choice of the functions here, just a refined version of what we were talking about.
The functions associated with the type like this are called member functions, or methods.
Once you create something of type ListNode, and instance of it, you can now access the methods directly:
ListNode node;
node.insert_next_node(some_data);
When you are implementing “insert_next_node”, it’s just a function, but it is always passed a secret argument: “this” pointer. This is a pointer to the instance of the object that we are invoking the method on. Now, even though this is not a part of function signature, you can within the function use it to access the object’s data, e.g. this->data, this->node. In C++ you can in fact drop the mention of it, just type data or node.
Many languages replicate this behaviour of this (hahaha). Python has the same concept, and calls it, more philosophically, self; but there you have to be explicit about it – have to mention it in the function declaration, have to use it explicitly.
Bjarne is a pedantic man, and he also thought about constness – if your variable is constant, then the methods you invoke should not change it. But some methods might want to. Back in the day it was hard to implement compilers that can detect this automatically, besides, he likes stating the intent very explicitly in the code, does Bjarne, so now you can declare a member function to be constant. It means that it can work with the object’s data at will, but it will not attempt to change any of the objects data. These methods can be invoked on constant variables, others can’t.
For example:
struct ListNode{
void* data;
ListNode* next;
void insert_next_node(void* new_data);
void remove();
bool is_last() const;
};
Insertion and removals change data, but checking if the node is the last one doesn’t, so.
Aliasing
Look at this code – it talks about a list node, but as soon as we get hold of the first node, we actually have the entire list (we can access any element in it by following the next pointer). Further, any node we take, is also the first element of a list (the one that starts with it). ListNode is also a List. It’s a trivial type of polymorphism (we’ll talk about it more later), but in C++ you can easily represent this by using typedefs, a typedef is just an alias for another type.
typedef List = ListNode;
(in newer versions of C++ the keyword is “using”, so using List = ListNode).
So we can write code that talks about lists and not care that it’s implemented the same way as a ListNode.
Static Members
Another thing is, there will always be functions associated with the type that do not refer to a particular instance, but are a part of the type interface. For example, it would be very handy to have a function that takes an array of data objects and builds a list out of them. Such functions need to have an intimate knowledge of the implementation of the list, and is also very closely tied to the idea of a list, but doesn’t operate on a particular node instance.
For example:
struct ListNode{
void* data;
ListNode* next;
void insert_next_node(void* new_data);
void remove();
bool is_last() const;
static List* create_from_array( void * data_array[]);
};
Don’t worry about the ugly syntax for passing arrays, there are better ways to do this now, but we’ll get there. The key point is that create_from_aray is a part of List interface, but doesn’t need this pointer, and doesn’t get it either.
All languages that support OO programming provide some variant of static member functions.
Classes, Access Control
Here Bjarne looked at what he did and he wasn’t happy. Ok, we have a way to represent a data type properly now, with functions and all; and if we write a good set of methods and document it all nicely, people can just use those and they can just not touch the data members ever. He even came up with the concept of accessors, functions whose only purpose is to let you read the data that is relevant to the abstraction; so you don’t have to access the data directly. You can even have two versions of each, for constant and non-constant access.
struct ListNode{
void* data;
ListNode* next;
void insert_next_node(void* new_data);
void remove();
bool is_last() const;
void* get_data();
ListNode* get_head();
List* get_tail();
void* get_data() const;
ListNode* get_head() const;
List* get_tail() const;
static List* create_from_array( void * data_array[]);
};
Better, it’s all there. But Bjarne, even when young, knew that people are not good. They will not read the documentation, they will not consider the big picture, not care about the purity of the interface based design. He had to do more to stop them from sinning.
He himself was good, and he could not start changing the behaviour of struct, he wanted to be compatible with C, he was not inventing a new language, humbly, he was just improving the best language there was.
He then added classes to the language.
That turned out to be such a cool move that “class” became a synonym for data type; in fact most people talk about classes rather than data types, in all languages, programming and human. An object is most often defined as an instance of a class; rather than the philosophical concept that it represents; properties of an object (again, a term from philosophy) are described as data members of a class. A class itself is described as an implementation of objects, rather than what philosophers call sets of objects. It is hard to tell whether the ideas around OO programming came first or if they are essentially a description of how C++ does it; either way, before C++ it was neither popular nor practical.
It’s a good thing. This move made metaphysics real. Yeah, most people have no clue about the millennia of thinking that got us here; but then, only philosophers think that the world in which everyone is a philosopher would be anything but boring.
A class is like a struct, except that all members are private by default. Private means that no functions outside of the class members can use them. To make things accessible to the outside world, you declare them as public. Like with everything in C++, you can be explicit about privacy.
class List {
private:
void* data;
List* next;
public:
using ListNode = List;
void insert_next_node(void* new_data);
void remove();
bool is_last() const;
void* get_data();
ListNode* get_head();
List* get_tail();
void* get_data() const;
ListNode* get_head() const;
List* get_tail() const;
static List* create_from_array( void * data_array[]);
};
Notice now we renamed this class to List, and made ListNode an alias for it. This improves encapsulation a bit – to users, this interface represents a list, the fact that a single node shares the implementation with the list is irrelevant to the user.
Now, once we implement this class, all the user needs to know is the content of the public section of the class definition. We have encapsulation going strong.
Construction, destruction, and copying
This is nice, thought Bjarne, the ability to stop people from doing bad things to his abstraction pleased him immensely. The explicitness of it felt good too; it is kind to let people see why you are restricting what they can do, and what it is that you are restricting exactly. Not all languages are this understanding.
But there was a problem, and he knew this, his work was nowhere near done.
Classes define types, they are meant to be instantiated, values and variables of that type will be created, and then when they go out of scope, they will be destroyed.
When you create a value of any type, you want to initialise it: allocate any resources that are needed and, more often than not, set the initial value to something, meaningful. How exactly you do this depends on the type, but mostly there are only a few ways per type that you care for. For our List, every node needs a little bit of memory for the two pointers, and then it also needs memory for the data object, something allocated on the heap that it will hold on to as long as it lives. It is also very useful to initialise the pointer to next node to either an actual node, or to null – that way we know that this is either the end of the list (null) or a valid node (not null). There isn’t much more you can come up with really. You can write a couple of functions that implement this behaviour; easy. But! The data these operate on is private, ouch. Ok, make them member functions? No can do, member functions rely on this pointer, but these are invoked before the object exists, this pointer doesn’t even make sense. You can write them as static members, that would work, then all you have to do is that whenever you need one of these objects you invoke one of these static functions. Sometimes we do this, and we call these functions factory methods.
When your values go out of scope, if you are half decent a person, you have to release the memory and whatever other resources you acquired at the start. Here you have a this pointer, so just write a function to release stuff and tell people that they must invoke this function just before they stop caring about the value.
Bjarne’s heart sank. He knew that people will not listen. They will forget, or they will change their code and extend the scope of their variables, but forget to move the function call that gives back what they took from the system; or they simply won’t bother, coz they ain’t caring so much. Worse than that – Kernighan and Ritchie who invented C were still working in the same building, and he was just about to tell everyone that he made it even better, and then that they have to keep remembering these weird functions that had nothing to do with their abstractions. He would be laughed at and ridiculed, and rightly so. Bjarne liked computers and did not like being laughed at.
Then he realised – in his compiler code, he always knows when a value is being created, and he already tracks when values go out scope (so that he can report errors about invalid variable access to programmers). His compiler always knows when these magic functions should be invoked. Not only that, a lot of the time he knows how to initialise member variables sensibly, numbers to zero, booleans to false, pointers to nulls, etc. For everything else, it is not unreasonable to expect people who are creating new types to write functions that implement such an important part of the type’s behaviour; just that, he can do everything else.
He created constructors and destructors.
They are functions with special purpose and special use. Constructors are functions that are members of a class, have no return type (not even void), and have the same name as the class. Then even when you invoke them explicitly, your code just mentions the class, no new names are added to the context. Mostly they are just invoked automatically when you create a variable or a value. Destructors look almost like constructors when you declare them, except that they have ~ in front of the name; like ~List, and they take no arguments ever. You never need to invoke them directly, C++ will do that for you. The ~ was cheeky, it’s used in C to denote bitwise logical NOT, so it’s like saying “this is the function after which List is not anymore”, it takes an object from being to non-being. Bjarne squeezed metaphysics in and never told anyone.
class List {
private:
void* data;
List* next;
public:
using ListNode = List;
List();
List(void * data_array[]);
~List();
void insert_next_node(void* new_data);
void remove();
bool is_last() const;
void* get_data();
ListNode* get_head();
List* get_tail();
void* get_data() const;
ListNode* get_head() const;
List* get_tail() const;
};
We don’t need the static member to create things from an array any more, we have a constructor for that now.
Almost there… the last bit of it is dealing with heap allocations. In C, to allocate memory on the heap you use a function that comes as standard, but it is not exactly a part of the language, it’s just a function someone wrote; and there are a few variations of those around (allocate, malloc, _alloc, __malloc…). It works with void pointers that you then cast to what type you want it to be. Then to release the memory you call another function (free, dealloc…). If you do that in C++, you would then still have to deal with constructors and destructors yourself. Also, these functions are famously messy and annoying, you have to tell them how much memory you need, which means you have to think about what your data looks like in RAM … just breaks all your abstractions. But with classes … it’s all there, the compiler can calculate how much memory it would take just by reading your code.
Bjarne implemented special keywords in the language to do all that, and to invoke constructors and destructors correctly too. He called them new and delete. This was a big deal. A whole new level of freedom of abstractions from the shackles of machine architecture.
It looks pretty good too:
List* l = new List();
delete l;
Later we even figured out how to uses classes to even get rid of the need to type delete, we’ll get there.
Bjarne was well pleased with himself, and rightly so.
He was on top of this now, he added a little bit more detail to terminology, and squeezed every last drop of deductions within the compiler he could see.
There is a special kind of constructor called copy constructor, it’s used when copying data between variables of same type; worth considering because often you can optimise it or do other smart stuff there. There are also specials functions to give you same flexibility when assigning values to variables (not same as construction because when assigning, the larger object already exists).
People talk about copy-semantics, this is it. How objects get copied can make a difference between usable and unusable code. It is insidious – you pass an object to a function, and if you are not careful data starts getting copied like crazy, you start hitting RAM, it all starts crawling, and all you wanted to do is add two numbers together. Languages provide their own defaults for this; C++ will by default copy entire lists and vectors, Python will just copy references to same objects. In C++, everyone gets their own copy of data, they can change it and not affect others, but it makes moving things around a lot slower; you can implement a different behaviour if you like. In python, moving objects around is a lot faster, because they always point to the same object. It’s quick, but if anyone changes it, everyone else is affected; side effects galore. In Python, if you actually want to create your own instance of the entire thing, it’s called a deep copy, you have to either do it yourself by traversing and recreating everything, or rely on library functions that do it for you.
Lots of nuance, but at the end the core idea is – if you are implementing a type, you have to be able to describe how objects are to come into being and go out of it, and how they move between contexts. There will always be a default behaviour, it often works, you just need to learn all about it before you start using a language.
Inheritance
Young Bjarne was on the roll now. He looked at this and realised that even with his typedef, people using his code still see too much of a list in a list node, and too much of a node in the list. The first thing he saw was that there’s actually two interfaces here, all mixed together, one for the list, one for the node; and that the one for the list adds methods to the one to the node. There was more to it, but this was poking him in the eye. It took a while to figure it all out, but first he thought about this relationship … at least in this implementation, a List is also a Node. This bit is implementation specific, it’s only so because that’s how he chose to implement the list abstraction. So he parked this for a bit and started thinking about relationships between classes.
Maybe he thought of shapes, he probably did, they are simple and we know a lot about them. Maybe he did something like this:
class Point{
public:
float x();
float y();
};
class Shape {
public:
Point centre();
float width();
float height();
};
class Circle {
public:
Point centre();
float width();
float height();
float radius();
};
class Square {
public:
Point centre();
float width();
float height();
float side_length();
};
class Triangle {
public:
Point centre();
float width();
float height();
float side_1();
float side_2();
float side_3();
};
He didn’t bother with the implementation details, constructors, destructors, or any other methods that may be useful in real life. He was looking at pure abstractions.
There’s patterns here. A shape has a centre, a width, and a height. It contains them. We still call this containment and it models a has-a relationship. Easy.
Now, a circle, a square, and a triangle are all shapes. This is a deeper relationship. It’s an is-a relationship. All of these have the same methods that the shape has, and then some more that are specific to them. This can be made explicit, it is also a very generic thing, it’s all over the place.
He called this inheritance and came up with a notation for it:
class Point{
public:
float x();
float y();
};
class Shape {
public:
Point centre();
float width();
float height();
};
class Circle : public Shape {
public:
float radius();
};
class Square : public Shape{
public:
float side_length();
};
class Triangle : public Shape{
public:
float side_1();
float side_2();
float side_3();
};
Shape is a base class for circle, triangle, and square; or circle, triangle, and square are derived from shape – same thing, just depends on your point of view.
The “public” in front of the base class … that was Bjarne being very pedantic; derived classes inherit everything from base classes, data too. So you want to be able to control which parts of the base class derived class methods can access. For this he also had to introduce a “protected” level of access (methods and data that can be accessed by derived classes but nothing else); and then to have full and complete control, you can specify the type of inheritance which works together with these specifications in a stupidly complicated set of rules. Ignore it until you really need it, then read the documentation or something. It was all put together when we didn’t really know quite how we will use it, so Bjarne was leaving his options open.
Anyway … a derived class gets everything from a base class; it’s an extension of it. Many languages use “extends” to denote inheritance. This is cool. It also means that these types are not polymorphic – a circle is a shape, you can use it whenever a shape is required. Not every shape is a circle though, so you can only use a shape where a circle is required if that shape is actually a circle. Every shape has an actual type.
Casting
There’s a subtle point here; you’ll understand it now and then forget it, but hopefully remember it again when needed. When you create an instance of a class, an object, you have to do that using its actual type; because memory has to be allocated correctly, the right constructor has to be invoked etc. So once you create a variable like this, it has to be of the actual type. The only way to use polymorphic properties of classes is through references or pointers. A reference (or a pointer) to a shape, can actually refer (or point to) a circle. It’s not as bad as it sounds, there is always this two step process, but there is syntax and tricks to make it barely visible in code. There are also features in C++ to let you go the other way around – if you have a reference to a shape, if it actually refers to a circle, you can cast it to a reference to a circle (same for pointers). You can go up and down the chain like this. It’s called dynamic casting. (because Bjarne was so thoughtful, there is actually four different kinds of casting in C++ and you have to be real precise about what you mean; it’s a nightmare) Other languages let you do it too, but they are generally much more simple to use, just say what type you want to convert your thing to, and if it is possible, they do it, if not they tell you it can’t be done.
If you have to cast, it means that you are jumping between levels of abstractions. It is a red light that your design may not be perfect yet. No big deal, but you should be aware of it. Also that many arrogant pricks see casting as a sign of weakness.
Class Hierarchies
Back to inheritance.
It can get complicated, a circle is a special case of ellipse, which is also a shape, there are many types of triangles, all are triangles. You end up with entire hierarchies. When it gets interesting is when you have multiple inheritance – a right angled isosceles triangle, it should be derived from both a right angled triangle and isosceles triangle. But those two are distinct types, they may end up having data and methods with same names meaning different things – which one ends up in the right angled isosceles triangle?
There’s conventions and workarounds, restrictions and black magic. There isn’t a pretty solution; you have to design the entire hierarchy at once if you need multiple inheritance. Which is a problem, because you can design a nice clean hierarchy, and then someone else (or a future you) realises they need to add something else that derives from multiple leaves of your hierarchy. Most people just give up and don’t do multiple inheritance, many languages simply say, nope, can’t do that. Most have a halfway solution, we’ll get to it. In any case, it is hard to do, and never leaves you satisfied.
In any case, using dynamic casting to move around all these hierarchies is called navigating the class hierarchy.
Virtual Functions, v-tables
Here’s a question: what if I wanted to be able to ask a shape for its area?
Every shape has it, but the shape class can’t know how to calculate it for all the subclasses. It doesn’t even know about its subclasses (bad parenting, but good for keeping abstractions pure).
Bjarne thought about this too. He created virtual functions. These are methods on a class that are declared in a base, but derived classes can override them to provide a version that is meaningful for them. Then, when you invoke these functions, no matter what type the reference or the pointer is, the version of the function that gets invoked is the one from the actual type of the object. So you ask a shape for its area, and no matter what the actual type is, the correct function gets invoked. If you do this right, you almost never have to cast things (or so the theory goes, not true, but a worthy thing to aspire to when designing classes).
The way it works is that when you write your base class, you just mark a function as virtual. From then on, if a derived class implements the same function, it is used as override. You can either provide a default implementation in your base class, if you have one, or you can say that the function is equal to 0. It is then a pure virtual function, and you cannot create the object of the base class type; you can only create objects of the derived types that implement the function. The base class is only used for polymorphism, you can only have references or pointers to it, but they must refer to instances of types that have all functions implemented. Such classes are called abstract classes.
The way it works at the machine level is actually very simple.
Whenever there is a virtual function in a class, a secret data member is created that becomes a part of the class. You can never see it, but it is there. It’s a just a table that maps function names (and signatures) to function pointers. When you create an instance of anything in the class hierarchy, the table gets populated by pointers to correct functions that are relevant to that class. Then when a virtual function is invoked, before anything else happens, there is a table lookup, and the function that gets invoked is the one at the address in the table. Everything else works just like for any other function.
This mechanism is so simple and applicable, that v-table became a term used in all languages that implement virtual functions; which is anything that implements classes. In python, every member function is virtual, you don’t even have to declare it.
It adds a little bit more work to invoking a function, so if you are going to be calling one of these millions of times and it is a small function, you may need to worry about it. But you really have to be sure that’s the reason, it very rarely is. These days compilers detect this and fix it for you automatically most times.
Abstract Classes, Interfaces
Our boy Bjarne was troubled by all this mess with multiple inheritance, and more generally by the details of machinery influencing the design of abstractions. He thought about it a lot, read a lot, thought about it more.
His shapes were beautiful and worked so well as long as he was thinking about them as pure abstractions; each just described by the methods and properties they have. But as soon as he started looking into implementations, the data behind them, he had to start thinking about their interactions at the implementation level; about storage, this and that. It was all becoming tangled up and difficult to separate. He didn’t like that.
True, the inheritance of implementation and all the consequences of it are necessary, but why should the users of his beautiful concepts have to be burdened by that? If he could find a way for them to use all this without any concern for the code behind it, wouldn’t that be nice? Then if he wanted to fix things later to extend his hierarchies, or just make them faster because he has more time to work on it, he can do that, and the users don’t have to change a thing in their code. He should really be the only person suffering the burden of the implementation details.
What if, thought our hero to himself, I make the shape, and the circle, and all of them really pure – make them only contain pure virtual functions. No implementation, no data, just v-tables. Then, I derive a class from shape that contains the implementation, and a class from that that is also derived from the pure circle, and contains the implementation of the circle parts. And I do that for everything. So I have Shape and ShapeImpl, Circle and CircleImpl … ShapeImpl derived from Shape, CircleImpl derived from ShapeImpl and Circle.
Wow, wow, wow. I don’t have to do anything new for this, just lay it out like that and maybe provide some functions to create them properly, like create_circle creates a CircleImpl but returns a pointer to Circle. Like a factory for these objects – a separate class that just has static methods that create pointers to stuff in my purely abstract hierarchy.
Then my users only ever see and use the pure abstractions. I can easily tuck all the implementation stuff away – hell, I’ll add more bits and pieces to the language to help that. I can never be free of the implementation details, realised Bjarne, but people using my abstractions can. A future me also can – once I build this, I can freely think about things I’m doing with it, build on top of it, without ever again caring about the stuff in the other context.
And so, the idea of a pure interface came about and became real. A set of functions fully describing an abstraction, completely detached from the implementation.
Along the way, patterns for building such things; abstract data factory, pimpl (pointer to implementation), both are terms used all the time now. Implemented in other ways too, but the patterns are there.
Java formalised this a bit – there you have a separate concept (and a keyword) for interfaces (these pure abstract classes) and you can only do multiple inheritance from multiple interfaces.
Polymorphism
We talked about function overloading already, even in C this was easy, in C++ it is used a lot more. It’s a dead simple mechanism, just define functions with the same name but different parameters. That way, you have the same idea applied to different data types. Bjarne’s overriding of virtual functions has a different implementation, but conceptually it’s the a type of overloading – remember the this pointer? Conceptually it is passed in to functions, and it always has the type of pointer to the class that contains the function; so … same name, but the first argument is of the different type. With overriding, you can’t change any other arguments, in fact, you can’t change any, the compiler does it for you. This is because you can combine the two – you can have a whole lot of overloaded virtual functions and then override them in derived classes. People even ask about the distinction in interviews … I don’t like that; it is an implementation detail. You have to remember the distinction when coding, so that you know what overrides what, but really, the important bit here is that you can make functions polymorphic.
With class hierarchies, you can make objects polymorphic.
All this matters because even though the type systems are profoundly useful, thinking about types in your problem is at a meta-level. Certainly needed and useful, but once you defined the data and their relationships in your domain, you should be able to focus freely on the details of the story. The classes you created are part of the language you tell the story in, not the story itself. Designing things to be polymorphic then lets you write stories without have to convert types only so that you invoke a function, or name things differently that mean the same thing to you, only because they operate on different types.
Object oriented books and courses, generally, tell you that polymorphism lets you think less. It’s bullshit, they let you think more. Whenever you switch contexts between different levels of abstractions, you are dropping out of the flow, it slows you down.
Young Bjarne saw another way to help this, he is autotelic, he cares about the flow. He provided a way to implement your own type conversions as parts of classes. Think back to our ListNode. It carries information about the data structure (the “next” pointer), but that is rarely of interest when you are looking at a single node; what you most often really care for is the data that the node contains. To you, in most cases, the node is the data; why should you care about how your data is organised once you have made that decision already. I need something else to be able to show you code that does it, but what you should be able to do is use the node whenever you would want to use the data that it carries. With a little bit of care, you can specify how to do this when defining your classes.
You have to be careful because you have to be sure – you are providing means for automatic conversions between types (they are actually called implicit conversions); if you do that, but also have have an overloaded function that has a version that takes the node, and another one that takes the type of the data that the node contains, it becomes messy and confusing – which of the two function is called. Many people hate implicit conversions; relying on abstractions and having to think about it scares them. It’s not their fault, they believe what books promising easy life say; but it is annoying.
Namespaces, Friends
Our boy Bjarne did not set out to change the world. He just wanted to give it something that he thought would help others like him express their minds better, and he knew he could make it efficient too, even though the concepts were very high level. It made sense to him.
But he did so well, so very well, that a whole industry was built around trying to encapsulate (haha) the success into simplistic rules without having to consider all the depth that goes into making things beautiful. That is all that Object Oriented programming is.
Bjarne didn’t worry about that; he did other things to make it all work better, things that matter, but were either too complicated to comprehend, or didn’t fit into the simplified OO narrative, so are rarely mentioned in that context.
He knew, for example, that not every context we think of is a class or an object, and that they can be nested too. He created namespaces; simple and a pure abstraction that groups names of things. Nothing else really, no storage, no prescribed content, just a name. It can contain variables, functions, classes, or other namespaces. Then if you want to access anything definition within a namespace, either you include the name of the namespace as a prefix to the name of the thing, or you can just say that within your current context you will using whatever is in a particular namespace, as a way to merge contexts. I haven’t seen this implemented in other languages and I have no idea why. Most languages expect you to use classes with static members for this; which works I guess, but it’s not as pretty.
He also realised that the encapsulation cuts both ways; sometimes you can make things prettier if you allow classes or functions to know the private details of another class. The fact that you choose to model abstractions in a certain way in their interfaces, and to hide some details from the interfaces, shouldn’t force you to implement them in exactly the same way. Interfaces and implementation are two separate contexts, entirely different worlds; the artistic part of programming is all about making these work together. He called such classes and functions that are allowed to know intimate details of another class friends (awn). Object Oriented fanatics really don’t like this, to them encapsulation is holier than beauty. Few (if any) other languages implement this.
Templates
There was something else grating our hero: void*. Not only does it not have any useful type information, it is really a void for types; once they go there, they aren’t coming back. If you use it, you just have to remember what it was if you want to recover it.
The only reason we use it though, is because we are implementing abstractions so generic that they would work for any type. Other languages try to hide this, for example Java and Python have a base type for all objects (Object), which looks something but it acts the same as void*.
What can be done about this?
One thing you could do is implement a version of your generic abstractions over and over again for each type that you care for. It doesn’t have to be too bad, you wouldn’t need too many of those in a typical application. If you did that, you would just be copying code over and over again, and only ever change the type declarations in a few places (wherever you had void* before). So in your new implementation everything is “strongly typed”, meaning, everything has an actual type. It’s a good thing, compilers can check for errors a lot better like that.
But it’s really boring.
In C, and some other languages, there is a thing called macro pre-processor. It’s a special little language that you can use to prepare code for compiling. It does things like include other files into your file, but also you can write macros that create bits of code from the parameters they take. It works, but it is very limited, somewhat awkward to use, and it is in no way related to C or C++, just some little language for generating text from parameters. Anything more complicated that simplest trivial application and you are bound to introduce bugs into your code.
Having said all that – even though it is an unintended use of them, they give us a way to generate code from templates.
Our boy Bjarne came up with a framework within his language to generate code in exactly the way we need to, to be able to get rid of void*. It sounds petty, but it isn’t; it means now that you can write your abstractions, no matter how generic, as complete descriptions of what they do and the types they generate. This in turn makes it all consistent: a list node that contains a string is not the same type as the list node that contains an integer.
He called this … erm, templates.
There is a lot of nuance to them, he thought it all through in depth. You can, for example, specialise them and provide specialised implementations for certain types, because that may make things faster; or even because the world is not perfect and not all types implement the same interfaces, so you may have to do something special to make generic functions work for some types. You have to be very careful about that last point; your templated code may depend on properties of the types it operates on in subtle ways. All in all, it is hard to do right. But it opens doors for implementing beautiful generic abstractions that are also very efficient. C++ makes heavy use of them, it comes with a standard library of types and functions, many of which are implemented as templates; you get all the standard data structures (vectors, lists, maps…) prepackaged and in libraries that you include in your code. All you ever have to do is remember to declare the type of the contained data and off you go. Strings in C++ are also implemented like this.
So is an important concept of a smart pointer. Smart pointers are a beautiful idea that lets you use pointers without having to remember to release the memory – it happens automatically. They are a part of the standard library, implemented in code using templates, not a part of core C++ language definition, but they are so standard that they change the way your code looks; they free it hugely from the burden of having to manually manage memory. Look them up and implement your own version when you start using C++.
Few other languages implement templates. Java does, but I hear they are horrible, same with C#. Scala does, and they are actually quite ok, though somewhat involved. Python doesn’t.
Putting it all Together
With all this, let’s have a quick look at what it would mean for our implementation of the linked list. This is much closer to what it would really look like in C++ code.
namespace garden{
template< typename T >
class ListNode {
private:
smart_pointer<T> _data;
smart_pointer<ListNode> _next;
public:
ListNode(T* node_data);
~ListNode();
T& get_data();
const T& get_data() const;
bool is_last() const;
friend class List;
};
template< typename T>
class List {
private:
smart_pointer<ListNode<T>> head;
public:
List();
void append(T* new_data);
void remove_node(ListNode<T>& node);
ListNode<T>& get_head();
List<T> get_tail();
};
template<typename T>
List<T> createListFromVector(vector<T>& data_vector);
}
There is a bit of getting used to things to be able to read this easily, and in real use cases there would be more methods and versions of them available (for various cases of constness). It is hard to write a properly generic template, but this illustrates a lot.
We have all this in our own namespace, so if we are using some other library that also implements their own version of a list, that’s cool, each lives in their own namespace.
We properly separated ListNode from List, they are different things and our interface shows this. List needs to access the next pointer in the list node, so they are friends now.
Both are templated – this works no matter what the type of contained data is. Because we fenced it all in a namespace, we can also implement functions that operate on top of the list separately, as just functions, not static members; we grouped it all together nicely.
Because we are using templates, the return values from all the functions in the interface are properly typed, we don’t need to be casting things every time we use them.
We don’t even need pointers, references are prettier to use, so we return references instead of pointers.
Where we have to use pointers, we use smart ones, they come with standard C++ library, so does a vector.
Now to declare a List of strings, we can just write List<string> ls; , or a list of integers: List<int> li;.
This is called static polymorphism, or compile time polymorphism.
The things is, whenever you declare an instance of a template with actual types as template arguments, the compiler first generates code for the actual types, and then compiles that code as if the types were hardcoded. There is no difference at the machine level, all this vanishes before machine instructions are generated.
It does mean that your final compiled binary will contain separate code for the implementation of a list of integers and one for the implementation of a list of strings. There are ways to make this go crazy and your binaries become much bigger than you want them to (it’s a type of code bloat in geekspeak). But it’s rare and you have to get quite fancy with how you use templates for it to become a concern.
There is much more to templates, books get written exclusively about all the magic you can do with them. As a general guideline, you would use them only if you are writing something that is meant to be very generic and will be used as such, if you will end up copying and pasting a lot of code without them, or if you think using templates makes you cool.
About Bjarne Stroustrup
I don’t know Bjarne Stroustrup outside of his work. I saw him speak once; Morgan Stanley employs him as a special consultant, he spoke at one of their town halls. He was bored and uncomfortable and went through slides he had prepared for some more interesting C++ conference. I read his books, they are written as technical manuals and only rarely, mostly in footnotes, can you glimpse the love he feels for what he does.
I invented this story of him completely, it seemed appropriate. In his own words, of course he did not come up with all of this on his own and out of nothing, he often cites a language called SmallTalk as source of a lot of inspiration. Outside of his references, nobody ever heard of SmallTalk.
What is certain is that even though many of the ideas he implemented came from somewhere, other languages, academic papers, conferences, and discussions; nobody else put together a coherent framework that was also usable and efficient enough to use, and that combines such very deep thought about every detail with a clear understanding of how important storytelling is.
He, indeed, did not set out to change the world, but in very many ways he did. Which made him a very good template for the hero of this part of our story.
Comments, Formatting, Documentation, Testing
Programmers spend a lot of time reading code, their own or someone else’s. This is how we learn, how we debug, and how we communicate ideas.
Every language provides a way to add comments to code, snippets of text that additionally explains what’s going on in code.
There is a lot written about the right way to comment, format, or document code. Some say that you should comment everything, basically duplicate your code in a natural language. It’s nonsense, why would you make people read the same thing twice, and interrupt their flow as they are reading code? Some say that you should comment nothing, code should be written to be so expressive you don’t need anything else. It’s nonsense, there is a reason why we don’t write emails in C++. Similarly, there are very strong opinions about formatting code, and exactly the right amount of documentation that should be written.
In the end, none of that is good advice. Help your fellow human being understand what you meant, why you did what you did – layout your code and comments with only that in mind. Help them use the code you wrote without having to understand all the nuance and pain that you had to go through to write it, layout your documentation like that.
There are a lot of tools around that would translate comments into documentation automatically. They prescribe a way to comment functions and their arguments, classes and their members, and as long as you follow that, they will generate documentation about each function and each class in a pretty html layout. (docuscript, javadoc … there’s many).
I don’t like them, I never found such documentation useful, but most disagree with me.
Python is big on this. Python syntax relies on formatting, you can’t get away from it, it uses whitespace as code and block delimiters. It also comes with pythondoc as standard, out of the box it has the ability to create documentation for your code from comments, it is part of every manual or book on Python.
Tests are equally confusing to people. This is because they have a dual purpose. Firstly, and more obviously, a good set of tests that you can run whenever something is changed (code or data) helps in long living projects and those developed by big teams; it’s good to catch bugs before you release code.
Secondly, if you write them with a bit of care, they are usage documentation for your code. To test a function, you have to write code that uses it. To test it fully, you have to write code that covers all use cases your function can handle. Programmers that come on board in a team can then start reading and stepping through test code; they will learn more about it that way than any other.
However, as long as everything works, writing tests doesn’t add anything to a project. To management they are often seen as pure cost – just get people to write code with no bugs and you don’t need tests; sure. On the other extreme is test-driven development: the first thing you write, before anything else, are tests. Then write code until they work. The idea is that you should have a clear picture in mind about what you are trying to achieve before you set out to write code for it. Except that most of the time you learn about the domain by probing it, and you come up with creative solutions while trying to solve problems and iterating over your attempts. Nonsense all of it.
As with documentation, help your fellow human beings by trying to reduce the amount of stress caused by a worry that they will break something with their changes, and by giving them something nice to read and step through while learning about your code.
Agile
Agile is a management cult-like paradigm that purports to be based on scientific method while it is fact based on wet dreams about fast delivery for purposes of spinning money.
It gains credibility by incorporating in its rituals some genuinely useful processes and procedures, that can sometimes help; but then does what every cult does and proscribes them as necessary and the only correct path to Good.
Learn all about it, then pick and choose bits that are useful to your projects; then go on and read about other ways to do it and do the same.
Leave a comment