In the beginning, there was machine code. Long lists of pure bit patterns encoding instructions known to the CPU, memory addresses, and numbers.
It was hard, but this is how computers work. To make it easier, we gave names to the instruction bit patterns, and started using numbers in decimal, octal, and hexadecimal notation. We also agreed on how to encode text using ASCII, and added a tiny little bit of syntax to it all; some commas and lines of comments in english that computers ignore. This is assembly code. It maps directly to machine code instructions, it just names them a little better so that we can recognise them. We call these names mnemonics, and the numbers that they represent: op-codes. We wrote little programs that can translate assembly code to machine code – assemblers.
We could now write code in a file that is (somewhat) readable, and give that file as an input to assembler, and it would output a pure binary file that contained machine code, instructions that a CPU can execute in order. We added something nice to it too: you could separate this code into a number of files and then reuse some of them over and over again. A simple instruction to the assembler would tell it to include the instructions from another file before it starts assembling. Coding became modular.
Because this mapping between assembler and machine code is so trivial, we also wrote disassemblers – programs to translate in the other direction. We also figured out that we can print the state of memory and registers and messages to ourselves at certain points in the code, and that we can even make the code stop so that we can see the state of affairs at a certain point during the execution – we could now debug code far, far better. It was a painful process still, but we could see what our thoughts look like when they do things.
We then started seeing patterns in how we think about data, and out of that emerged an understanding of abstract data types and data containers. We also figured out efficient ways to invoke parts of code with minimal information about what is going on inside them; we started implementing subroutines and then functions.
We already knew from theory before we even made computers that we can reduce all computation to a small number of constructs that we can keep reusing over and over again, and we now started understanding what a useful set of those would be; and we learned how to implement those abstractions in assembly. We were now thinking in variables, if statements, loops, and functions operating on stacks, arrays, and lists.
This was powerful. We could converse about algorithms without the burden of having to understand the details of the hardware. We had stories now, and the algorithmic constructs we came up with were our metaphors. We could tap into our imagination freely, and we could share it. It felt good. We started seeing patterns, patterns within patterns, structures within structures build upon structures within structures.
We slowly learned how to do better. In the early 1950s we started experimenting with writing code that would translate things that look more like these structured ideas we had into assembly, and then machine code. We called these translation programs compilers. In 1957 FORTRAN was developed (FORmula TRANslation). It had most of the features we still see in programming languages of today – variables, functions, loops, if statements. It still cared a whole lot about how the machine worked, a lot more than about how we think, but a lot less than assembly does.
Incidentally, this was the time when we also experimented with drugs a lot. It was a time of celebrating love and stories, a time when we were more open minded than ever before, or since. We can’t know how many of those who built what came next were on shrooms, LSD, dope; what music they listened to; if they read Dick and Lem; or who they had sex with; but you needed an open mind and that was the zeitgeist.
Some of them came at it from the point of view of linguistics, using theories put forward by a young leftie (Chomsky), some were fueled by their theology and faith, many were pursuing the deep understanding of human mind continuing the work of philosophers of the day, some were obsessed by the elegant beauty of algebra, others were funded by french nationalists to find ways to preserve the advancement of french language, yet others were engineers who just wanted others to see the beauty they see in making things work. Most had big beards and long hair, and were mathematicians.
It took a while, but we learned how to build parsers; algorithms that break simple languages down into data structures; and how to do it really well. We learned how far exactly we can push them and how far we cannot, we learned a lot about the semiotics, and how to map ideas into symbols with syntax and semantics, and then how to map those into simplistic operations that even a machine can do. It was a naive effort in many ways – most hoped it will help us understand what we are, what a mind is. It helped us understand a lot more about what we are not than about what we are, but along the way we built amazing things, and we still do on the back of it.
Procedural Programming Languages
Over time, procedural languages established themselves firmly in the mainstream. These are languages that you would know and recognise immediately; programs formulated as steps of actions and decisions that the computer is to perform. It started with Algol and Fortran, BASIC as a simplistic version of it. There were others, many, but then came C. It was small and elegant, it hit the right spot with the level of abstractions, and it still let you deal with the underlying abstractions of the operating system directly; even allowing room for interesting assembly directly into the C code. It was very fast, very powerful, easy enough to learn as long as you knew a little about computers. It was very portable – it moved you away from the machine just enough, and it was simple to parse, which meant you could write compilers for it relatively easily and make them very efficient. It was succinct, quick to type, but still very readable afterwards. And it integrated really well with operating system concepts.
Along with it came a powerful scripting language for Unix. Unlike C, shell scripts (that’s what they are called) are not compiled – they are interpreted. This means that there is never a binary machine code file created. You pass the script in a file to an interpreter, which parses it, builds data structures in memory to represent the meaning of the code, and then executes it directly by invoking functions that represent the commands of the language within its own code. It understands what you want and does it, rather than creating code that would do it.
It is a lot slower than compiled languages most of the time. To start with, you have to parse the code every time you execute it. Then, compilers can optimise code on the fly, they can do that quite well, but it takes time. Interpreters can’t spend too much time doing that or they would be even slower. But that doesn’t always matter, ease of use wins when things don’t have to be blindingly fast, or when they are small.
What did matter is that both of these were really good. We had pretty high level abstractions working efficiently. We could write pretty code, and the beauty would not hinder performance. We engineered ways to make things prettier and more understandable. We could talk about it and tell stories about it with ever more people, and still make money.
Then came the understanding that the meaning we give to data is all about what we can do with data, what we can ask about it, what we can transform it into, how it interacts with other data. Few formulate it this way, but that is what object oriented programming is all about. There were many implementations of this, but when C++ added features to C to support object oriented programming better it became huge. Again, we had even more abstractions at hand, but it was still insanely efficient. From here on, we learned a whole lot about data types, there are branches of algebra developed to try to describe them and they get applied in very modern languages (e.g. Scala).
After that, we started realising that compiling code for a particular kind of machine or OS was a real pain; and at the time there were a lot more of those. So we made Java. Java was very much like C++, but instead of running on a computer, it would run inside a special virtual machine – JVM (Java Virtual Machine). Java code compiles to assembly-like code (Java bytecode) that runs on JVM. There are implementations of JVM for many operating systems and many bare metal platforms. The aim was to make distributed computing easier. Another thing it introduced is “garbage collection”: it still has a heap for dealing with memory, but you don’t have to worry about releasing it – it tracks your usage of memory and once in a while it throws the unused stuff back to the heap.
Java was meant to run the internet, be the language for it. It was doomed to fail in that intent, but it introduced the idea of bytecode execution and garbage collection, and made them work. It’s still not as efficient as fully compiled languages, but when simplicity and easy portability matter more; it all works quite well. Python does it – it has its own bytecode which it generates on the first run of a program and then re-uses on the next run if nothing changed (or recreates it if something did).
And so, we know how to write things that are very abstract and make them run very fast; and we know how to write things that are even more abstract and make them easy when we don’t need so much speed. We can tell all sorts of stories.
Other Kinds of Programming Languages
Mind you, it wasn’t like this journey was linear at all; there was a lot of meandering and there are alternatives to procedural languages.
Firstly, there are functional ones.
Procedural languages are ultimately based on Turing’s view of computations, but we knew even back then that there are representations that are quite different in appearance but have exactly the same expressive and computational power. The most important one of these is Church’s lambda calculus. There everything is a function, even numbers are functions (they don’t do much, they just exist, it is very abstract). You do everything by having functions call each other; there are no loops, you use recursion instead and so on.
There is elegance and purity to this view. It comes more naturally to mathematicians (apparently! I disagree), but most importantly it completely abstracts the machinery away – all you ever write is what you want to compute; you never mention pointers, loops, instructions … none of that. The first implementation of this was LISP (LISt Processing) – it added a concept of a list of things to a pure functional view of the world, and it does work. Thinking about why it does provides lots of insights into the nature of algorithms and their meaning (there’s a weird and beautiful book called “Structure and Interpretation of Computer Programs” completely based on a language derived directly from LISP, https://edge.edx.org/assets/courseware/v1/480fe74fe104d87fc504da0ef24d85af/c4x/uc-berkeley/cs61as-1x/asset/sicp.pdf ).
Functional languages are still around. Haskell is the most mainstream one, lisp is still used for some scripting add ons in important applications and so on; there are many examples, but they are niche. Much bigger impact was in the understanding of the paradigm, the world view. Most programming languages in use support functional language features as a subset, and you can write in functional style in most popular languages. Often, you end up writing smaller portions of bigger projects using this style. It is now one of the metaphors we use regularly.
Another important direction was declarative programming. Here you don’t specify at all how you want things done, just what you want to get out of it. Turns out, using pure mathematical logic, and the induction logic steps used to prove things mechanically, you can actually represent all the computations that a Turing machine can. PROLOG (PROgramming LOGique) was developed with funding from the French Academy of Science who were worried that the purity of French language is being lost and wanted a way to analyse written human language. Correctly, the implementers realised that they need AI for this. Incorrectly they assumed that all you need is an induction engine that will start from axioms and churn through steps to reach proofs. They were foolish, but the experiment was super valuable. Prolog is still alive and people do write code in it. You code by stating what you want to get out of it (the “proof”) and it executes an induction loop breaking the requirements down and building a logical deduction tree from it. Along the way, it does what you want it to. It is weird, not really practical, but beautiful.
Like with functional languages though, the idea of declaring what you want to see in a solution, rather than the steps to get there, and relying on there existing logical steps from basic assumptions to your request that can be deduced automatically is in itself useful.
The most successful example by far is SQL – the queries only talk about the result you want to see, not about the way in which it should be done. It is domain specific though, it only works with databases and getting data out of them. Another one is the riskmod risk language.
There was also a lot of experimentation with visual programming languages. Some are quite good (e.g. Scratch), but generally it turns out that as soon as things become complex enough, drawing pictures of them doesn’t help much more than writing about them. There is plenty of usage in domain specific languages – all sorts of workflow based programming happens like this.
This may change – we are still learning; I am hopeful that graphical programming languages would become much more popular. They are nice, and they can be very pretty.
There is a lot more in the fringes – google “esoteric programming languages” when bored. Most of them were invented for fun, but quite a lot of experimentation happens there.
We just like our storytelling metaphors and languages being limited chafes us like a bitch.
Where Are We Now?
It continues. New languages keep popping up, old ones get improved. Domain specific languages (DSLs) keep getting written.
Technically, there have been many improvements, we learned a whole lot about how to make all this work faster; how to gel features from different paradigms into new languages targeted at specific use cases; how to make more complex deductions from code written in abstract languages about the usage of memory and CPU to both make things easier to write and to optimise the code on the fly. We got really quite good at that – compilers nowadays produce far more efficient code with far less effort than ever before. At the same time computers are way more powerful, so the quality of the code is often less critical to delivery timelines. Because of that, we now have increasingly better tools and at the same time increasing amounts of shit being produced. In short, the industrialization, and even democratisation of coding are doing very well.
Alas, there haven’t been any major inspired insights for a while in the development of programming languages. Perhaps it is laziness, or perhaps we peaked and there is little else to be done between what we have and natural languages. ChatGPT writing code on demand certainly seems to be the flavour of the day.
If I was to guess though, a lot of beauty is still to be found is in extending the applications of all this knowledge we have; perhaps new domains and new points of view brought in from exploring those domains bring along new inspiration in style and form, even if not in low level technicalities. Like, for example, it worked with oil paintings, or poetry, or mathematics.
How Do Compilers and Interpreters Actually Work?
There is a whole lot of theory behind this, linguistics, logic, mathematics, blah blah. It is cool stuff and fun if you have time for it, but also lots of detail, and lots of unnecessary obfuscation. At the end of the day, mostly it goes something like this:
- First you tokenize the code – you go through it and recognise what each separate token is intended to be: is it the language keyword, is it just a number, a variable name, operator etc.
- Then you go through the list of tokens and you figure out if they follow the rules of your syntax, and if they do, you arrange them into a data structure called an annotated syntax tree. Basically if you have something like 2+2, you build an object that represents addition and you store the left and right arguments (2 and 2 in this case) inside that object. If you have something like 2+2+2, that’s actually (2+2)+2, so the content of the bracket is an object like the one above, and then there is another “addition” object that has as first parameter the first addition object, and as second parameter just 2.
- Then if your language is clever, you go through all that and do all sorts of type analyses – are you trying to assign a number to a string variable or something like that. You may also do any type conversion of data there (strings to dates, integers to floats etc), if your language allows it. And you report errors.
- Then you have a complete representation of your code in a data structure where all nodes are annotated with the meaning of it and the arguments.
You can now make another pass and see if you can optimise things. For example, you can replace the complicated tree for 2+2+2 with a very simple one that just represents number 6. There’s a whole lot of stuff you can do here, but it does become slow and tiresome, only worth doing if it’s really worth doing. Modern compilers are superb at this though, by the time they are done with it you can barely recognise your code. They drop entire functions, drop variables, move code around, all sorts.
Then you have a choice – either you are going to traverse the tree structure you built and execute code that corresponds to the node (e.g, for the addition node, add the two parameters) like interpreters do; OR you generate assembly (or bytecode) that would perform these actions, like compilers do. Then you are done, though some languages (like Python) do a mixture: they generate an intermediate representation, “bytecode” and then automatically execute it. While this is still interpreted, it allows them to use compiler optimisation techniques, and store the optimised version of your code, that can then be executed faster.
Tokenization is most often done using regular expressions; they are just great for matching patterns and are used all over the place, not just for this (you should learn about them, this looks good https://www.sitepoint.com/learn-regex/). The whole step is so well researched, it can be automated fairly easily. Many tools do it for you.
Step 2, often called parsing, can also be generalised and there are several algorithms that will do it for you.
You can specify the entire grammar of your language using a meta language (the most popular one of those is called Bachus-Naur Form, BNF) and regular expressions, and then pass these specifications to a third party software which would generate the tokenization and parsing code for you in pretty much any language you like. With a couple of days of playing you can then have language parsers generated for you in Python, Java, C, C++ … whatever. The most popular tool for this is ANTLR.
Many languages have libraries that do this for you two – you feed them BNF and you get back a function or an object that parses it for you.
Basically, to get a simple basic language going is pretty easy.
I like writing the parsers myself though. It gives me more freedom in the design (all these tools have limitations about what the grammars can look like and can’t quite beat handwritten code for speed). And once you do one, it’s not really that hard.
The hard part is the design of the language, that is the invention bit, and then the optimisations if you do them, that needs thinking outside of the box. Both need an understanding about how programmers will represent meaning, and how to translate that meaning to something that executes. The rest is just coding.
Development Environment
The actual details of how you go from writing a program to executing it vary quite a bit from language to language.
There are some basic expectations that you can have.
- For pretty much all of them, the code is written in text files. They will have extensions that generally signify the language used, but that’s just convenience and convention in most cases.
- For compiled languages, the text files are input to the compiler and the output is a binary file. Either an executable you can just run, or a binary library that other programs can use (static or dynamic). Executables will have a function where the execution starts (in most languages it’s called ‘main’). Libraries will have some annotation, either in the code itself or in a separate file that tell it which functions or classes are to be “exported”, made visible to other binaries that load them.
- For interpreted languages, the files are the input to an interpreter; which will then parse them and immediately execute them. They don’t always have a ‘main’ function, the interpreter just executes the code as it finds it in the file.
- For languages like Java or Scala, you need both steps. First the compiler generates bytecode in separate files (.class files or beans in Java), then you need to run another executable that takes those files as input to execute them.
- All these compilers and interpreters are command line tools, and they generally take dozens of optional parameters. You can specify if you are building an executable or a library, how much do you want it optimised, to optimise for speed or for memory, and on and on and on.
- It becomes complicated where there is a lot of code split across many files. There are tools that define their own scripting languages just to handle this. They are like recipes for invoking compilers across complicated file structures. “Make” is the most popular one, ant is popular for java … there’s loads.
- There are “fully integrated” development environments that will help hugely with all this complexity; almost always, these is where programmers habituate. The first significant one was Microsoft’s visual studio for C++, all the rest are copying them. PyCharm is one, targeted at Python, Eclipse and IntelliJ were first written for Java, but can now handle many languages. Visual Studio Code is a Microsoft one (different from Visual Studio) that can support any language, has thousands of plugins, works on any platform you can think of; likely the best option around by far. What they do is hide all this complexity (or at least lots of it) from you. You can work across multiple files, they add syntax colouring, they keep track of what files are there and which ones are needed, have GUIs for compiler options, word completion for code, automatic formatting … all sorts.
They speed you up hugely. - Code repositories. This is actually orthogonal to the development environment, but is always used, and used all the time. These are specialised databases for code. They keep track of versions, they help share code, revert to a previous version, merge code changes made by you and some asshole who worked on the same file as you at the same time. Github (git) is so much more popular than any other option these days that it’s the only one you need to know about. It is super fast and more powerful than any competitors, but it’s odd and idiosyncratic and terribly frustrating at times. It is a fact of life though. Oh, it was written by Linus Torvalds, the same guy who started Linux.
- Debuggers. Every programming language will have them. They are special programs that run your code, but can stop execution at “breakpoints”. You mark a line in your code and it runs up there, then stops. Then it lets you look at values of variables, call stack (we’ll talk more about this later), sometimes even change the values of variables on the fly. It then lets you step through the code line by line – it executes it one line at the time and lets you check what’s going on. Modern ones are integrated into the development environments and the whole thing is very visual and very intuitive. You will not only be debugging, but you will learn much more about coding by stepping through code than you will by writing it. If your language has an optimising compiler, you will want to switch optimisations off for debugging sessions – the optimisation steps often output assembly that bears little resemblance to your code and you are stepping through something that is not executing. Another thing that compilers do just to help debugging is annotate assembly output with markers that point to the lines of code that produced that line of assembly code. That way debuggers know how to step through your source code while actually executing machine code underneath it. Once you are done debugging and ready to release though, all this extra stuff slows you down. Compiled programming languages always have options for building “debug” and “release” versions of code. The power of abstractions in programming languages is such that all this feels completely natural; you learn how to use powerful debuggers in a matter of hours, if that.
Leave a comment