Untangling the Web: How Neural Networks Really Work

Hello my friends! It’s been a while since my last post about Artificial Neural Networks. I’ve been very busy working on an exciting new AI project which I’m looking forward to telling you about. But today I’m here to talk to you a little bit more about some of that elusive foundational knowledge that makes AI in general a little easier to understand and to navigate.

The discussion of Neural Networks can get pretty technical pretty fast, and the challenge I have given myself is to strike the balance between deep enough and too deep. As always, all I am trying to do is share the level of knowledge that I’ve considered foundational in my efforts to transform myself into an AI consultant/technical seller or solution architect. I’ve had some validation on that front lately, as I was asked to deliver a talk at a local AI meetup about developing AI applications. The audience was AI professionals and enthusiasts and my talk was very well received. You can check it out for yourself if you’d like.

Anyway, back to the topic at hand. I had promised that after that introduction which focused on the way a single artificial neuron can be trained to solve actual problems, I would delve a little deeper into the types of topics that come up when you put a bunch of these neurons together into what is called a Neural Network. That’s what this post is about.

Building it Out

Now remember that we had a very high level concept of what an artificial neuron is:

We said that the Neuron took some input and “did something” to that input and produced the output. We later expanded on that and we discovered that in a simple neuron like the one above, the “doing something” could be represented as mathematical function f(x) where x is the input value and y is the output value. We further showed how the function itself could be expressed as y = wx + b, where w is a parameter called the “weight” and b is another parameter called the bias. Visually you can think of it like this:

So a value comes in on the left side, which is what the x represents, and the function that the neuron applies to that number is it multiplies x by the weight that is assigned to the arm from which x comes in, and then it ads the value of b to that, which is what produces y.

Now for things more complicated than converting from one simple value to another, you will want to increase the complexity. Let’s say, for example, that you want to be able to recognize colors. As you probably know, colors can be represented by 3 values, a red value, a green value and a blue value. So if you want to identify colors, you really want to identify some particular mix or red, green and blue. Wouldn’t it be nice if we had neurons that specialized in each colour? Like this:

Every time we get a colour value (R,G,B) we send the R to the first neuron, the G to the second neuron and the B to the third neuron. These neurons are what we call the input layer. They are very simple and the function they perform is basically just to pass along the value they got as input. Basically all three neurons will set w to 1 and b to zero.

Notice the Yellow example – here both the Red detector and the Green detector are shouting that they see their colour. Yellow is actually a mix of red and green light. We need to be able to use this information. So let’s say we are interested not only in 3 colors but in all the colors of the rainbow: red, orange, yellow, green, blue, indigo and violet. We need to increase the complexity a bit more. Now we are going to start building out a network of neurons. We will add what are called layers to the neural network. Let me show you what it looks like and I’ll explain:

Whoah that got complicated pretty fast! Don’t worry – it’s not as bad as it seems. First of all, notice the grey Neurons – those are the same colour detectors we were talking about before. We are going to call them “input neurons” or “the input layer”. Generally you have one input neuron for each independent feature of your data. In this case our data is colors expressed as 3 values or 3 features, so we have 3 input neurons.

The next layer, where I colored the neurons in green, is called the “output layer”. Generally you will have one output neuron for each category of output that you are looking to predict. In our case, we want to identify which of the 7 colors of the rainbow is being shown to the input layer, so we have 7 output neurons. This kind of neural network has a special name. It’s called a “perceptron”. It’s considered to be a single-layer neural network. The input layer doesn’t count because it’s basically a pass-through.

How it Works

Remember that when you are dealing with artificial neurons and now, artificial neural networks, you have to train them before you can use them. So at first, this neural network is TERRIBLE at detecting colors. That’s because all the weights and biases are just set to random numbers. So we have to show this network a bunch of colors and then tell it each time whether it got the answer right or wrong.

Imagine that on the far right, in the output boxes, we label them from top to bottom with the colors of the rainbow: red, orange, yellow, green, blue, indigo and violet. What we are looking for is that when the network sees red, the top box should have the highest value. Likewise if we see green, we would expect the 4th from the top box to have the highest value. But if we see yellow, we expect the 3rd from the top box to have the highest value. That’s what we want to train this neural network to do. So let’s say we have this scenario:

We showed the network Blue and the expected outputs are written on the right in red. But the network’s output does not match at all. In fact, if you look for the highest value in the outputs, the network thinks this is Yellow. That’s clearly wrong. How did this happen? Well, the input nodes took the values that they observed (zero, zero and 255 respectively) and passed them on to the output nodes. But the output nodes are artificial neurons. There are weights assigned to each of the connections between the input nodes and the output nodes, and each of the output nodes also has a bias.

Unlike when we had just a single input going into a super simple neuron, these neurons in the output layer have one input per input node. Each of the connections to an input node has its own weight. So instead of f(x) just being y = wx + b, wx gets replaced by a weighted sum (w₁x₁ + w₂x₂ + w₃x₃) so the function looks more like y = (w₁x₁ + w₂x₂ + w₃x₃) + b. So now you can think of the weights as being a kind of score that tells the neuron how much it should “care” about inputs it is getting from a particular source. You can probably imagine that eventually, the “red” output neuron will have weights near 0 for all but the input that is connected to the red input node. Training the model is all about getting to those correct values for the weights and biases.

Back Propagation

Now there is a process that kicks in called “back propagation”. This is where we tell each output neuron just how wrong it was. For example, we can tell the “blue” output neuron that it should have given us a value twice as high as what it gave us. So “your answer should have been the answer you gave me x2”. And we tell all the other neurons “your answer should have been the answer you gave me x0”. The algorithm takes that feedback and makes a tiny tweak to the bias and/or the weights of its inputs to make its error less next time. But remember, in training a neural network, we do this many, many times, so the tweaks are small but with each tweak we are trying to get closer and closer to good answers. The Gradient Descent algorithm keeps track of improvements and makes sure things don’t get worse.

Just like when we played around with that single-neuron last time, we can see here how the weights and biases can eventually be tuned so that the right output neuron gets the high score.

Back propagation is not something you NEED to understand very well to work in the AI field unless you are very much a data scientist working on training models. But it’s useful to get a basic sense of it because it helps you understand how neural networks LEARN. The best resource I have found to explain how it works in a way that won’t make your brain explode is this YouTube Video on the absolutely amazing 3Blue1Brown channel. The important thing is to realize that the correction starts at the output layer which then tells the preceding layer to correct itself which then tells the layer before that to correct itself and so on.

Activation Functions

So we are really starting to understand the basic idea of neural networks now. You have these artificial neurons that receive one or more input values and produce a single output value. The relationship between input and output is controlled by a function which takes a weighted sum of the inputs, and adds a bias. But that is not the entire story. There is something else we need to talk about and that’s the activation function.

The way I think about activation functions is that it’s a way the model tells a neuron how to treat its own output. You can think of the activation function as something that sits just to the right of a neuron, between the place where the output of the neuron comes out, and the place where that output is received. It intercepts the neuron’s output and messes with it.

The default is a linear function. It’s pretty much the implied default activation function we have been working with so far. It just says that the neuron should treat its output as-is. Just pass it along. But for a neural network to be useful across use cases, it needs to be able to handle producing non-linear data. This is why we have activation functions. They let the model instruct neurons to treat their outputs non-linearly. (I will try to make this a bit more clear)

There are a ton of different activation functions you can associate with the neurons in your network, and I won’t even try to talk about all of them. You can read about some of the most popular ones at this site. But I am going to cover 4 of them because in my experience they come up the most.

In these next few sections about activation functions x will refer to the neuron’s output (which we have been calling y so far) so keep that in mind. f(x) below is not the weighted sum function performed by the neuron itself, it’s an activation function that kicks in once the neuron has produced an output.

ReLU

f(x) = max(0,x)

ReLU (Rectified Linear Unit) is just a fancy name for a function that returns zero for output values below zero and returns the actual output value at or above zero. So if my output value is -2, ReLU will turn that into a zero. But if my output is 2, ReLU will keep it as a 2. So when a ReLU activation function is attached to a neuron (actually usually an entire layer in the neural network) then it’s basically saying that only strong signals should make it through and weak signals should be treated as zero. It’s basic noise reduction. This is a very popular function to use inside a neural network.

Sigmoid

f(x) = f(x) = 1 / (1 + e^(-x))

The sigmoid function maps the output value of a neuron to a range between 0 and 1, creating a smooth, S-shaped curve. This means lower values get pushed closer to 0, and higher values approach 1. Sigmoid is particularly useful for binary classification tasks, where the goal is to separate data into two categories, such as “True or False” or “Big or Small.” The output of the sigmoid function can be interpreted as the probability of belonging to one category. For instance, if you’re classifying something as “Big” or “Small,” an output of 0.9 indicates a high probability that the input is “Big,” whereas 0.1 would suggest a low probability of “Big,” making “Small” the more likely outcome.

Tanh

f(x) = (e^x – e^-x) / (e^x + e^-x)

The tanh function also produces an S-shaped curve, but it maps the output value of a neuron to a range between -1 and 1. It is typically used when you want to capture a spectrum between two opposing values. For example, if you’re working with a grayscale image where 0 represents black and 255 represents white, tanh can scale the output so black becomes -1, white becomes +1, and grey becomes 0. This makes tanh useful when you want to emphasize contrast between opposites (like black and white) while also accounting for middle-ground values, such as shades of grey.

Softmax

f(x_i) = e^(x_i) / sum(e^(x_j)) for all j

We’ve actually already used softmax, I just snuck it past you. In our colour detector example above, we used softmax on the outputs. This function is used on an entire layer at once – so it’s different in that it does not affect only a single neuron but a set of them. It basically says, normalize al the outputs into a probability distribution, where the sum of the probabilities equals 1. It’s perhaps best to give an example

Let’s say I have 3 values: 5, 8 and 15. If I used softmax on these, they would become: 0.00004536, 0.00091101, 0.99904363. You can see that this function is trying to convert the outputs of the neurons into something that helps you easily identify the biggest outputs. It’s for when you are looking for an answer among many. LLMs like ChatGPT actually use softmax in their output layers to determine the probability of the next word (or token). They pick from the most probable next tokens in the list.

Deep Learning and Multi-layer Neural Networks

So what we have been doing is building a neural network that is able to detect a feature called colour. But as I’m sure you suspect, neural networks can be used to detect much more complex sets of features. But the truth is that what we have already covered pretty much tells you most of what you need to know to understand how they can pull it off. It’s always the same idea – you have some input which you plug into the input of neurons, which process that input and change it to an output based on a function that takes input weights and a bias into account, and then there is an activation function that determines whether the output value should be altered further. Anything we talk about after this is basically just a slightly more complex version of the same thing. So at this point I am only going to touch on things quickly so that you get the main gist.

OK – let’s say you want to detect features that are somewhat more complicated than simple colors. Let’s say you want to detect shapes, for example. Well, it turns out that all you really need to do is add layers to your neural network. It’s important to remember the key ideas we have talked about so far – the way neurons work, the role of weights and biases and activation functions, and the process of tuning the metrics until a neuron generates the right output for the right signal. It’s important to remember how back-propagation teaches the neurons how to change their metrics based on how far from the ideal output they were, and that this process is iterative and gradual and guided by an algorithm called gradient descent. These are the same rules that apply no matter how big a neural network gets.

But what happens when you add layers to a neural network is that you can think of each layer as a kind of specialist feature detector. Each layer will identify one kind of feature and then pass the data on to the next layer which identifies another kind of feature, until by the time you reach the output layer, the values you are seeing are going to be different from each other for all sorts of different combinations of features.

The layers that you are adding in between the input and output layers are called “hidden layers”. Above I am showing a deep neural network with 1 hidden layer (the blue nodes). I don’t know why they are called hidden.

When building neural networks to classify data, engineers play around with how many hidden layers there are and how many nodes are in each layer and they even play around with how many connections there are between layers. You can play around with this as well using this great interactive tool.

For example check out this before and after:

Here you can see that on the left, I selected a data set where I have 2 categories and the data happens to be arranged in quadrants. My neural network’s job is to correctly classify the data points into their quadrants. I have set the “Activation Function” to ReLU – I have 3 input neurons, 2 hidden layers with 4 neurons each. Before I hit play, you can see that the dotted lines connecting the neurons to each other are basically neutral. But when I hit play, the training starts to modify the metrics using back propagation and after about 129 Epochs (or training iterations), my data looks pretty well sorted out and I can see which inter-neuron links are weighted most heavily, and I can also see which feature is being detected by which neuron.

You will have to experiment a little with this tool to really wrap your head around it – but I found it invaluable. Just change things up – see what difference the activation function makes – see if you can play around with the number of neurons or hidden layers – can you make this neural network find the correct classification faster than 129 epochs?

If you enjoyed playing with the above Tensorflow Playground tool, you might also really enjoy messing around with this interactive tool by Adam Harley. It lets you see how various features are detected by different layers in a network – in this cause for a character recognition task. It’s loads of fun.

Convolutional Neural Networks

A term you will run into when you start learning about Neural Networks is “Convolutional Neural Networks” or CNNs. This is a deep topic and I am not going to bother you with the details of it because, you know what? You don’t need to know it deeply. I spent a long time trying to understand it and it basically comes down to this:

When you build a neural network, you are architecting it to very carefully detect detailed features of the data that you are passing into to it. The neural networks become extremely sensitive to even the tiniest variations in features, which is why these neural networks have been performing so well at doing things like detecting tumors from X-rays and such. They are very sensitive. But sometimes, that sensitivity means that you miss the forrest for the trees. The NN can become stuck in the land of details and miss the big picture. Take this example:

Have you see this before? It looks like 3 girls enjoying coffee on a street corner. But try to squint at the image. Do you see anything else? Let me try to help you.

Do you see it now? You should have noticed an image that is reminiscent of common depictions of Jesus. This image was there the whole time, but it was difficult to see because we were focused on the details and missing some of the other, more macro level structures in the data. In neural networks, some layers are added whose job it is to detect these. They take the data which has already been through a layer of detailed feature detection and pass it through a “squint” layer where a lot of this information is compressed and averaged. This helps macro features stand out. I’m kind of mixing two concepts together there – the Convolutional Layer itself as well as something called the Pooling layer. But I don’t think it matters very much because as far as I can tell, you always pool after a convolution, so I’m using the term Convolution shorthand for the entire process.

After convolution, your neural network will typically have fewer neurons than the layers right before. Shrinking the data set means that you have to combine nearby data points together using some sort of a function. Often it’s Softmax, or a Mean function, where each input into the convolution layer will be some combination of outputs from the previous layer. Here are some examples of how that might work:

On the left side of each image is the data you start with and on the right side is the data once you’ve done the convolution and pooling. You are losing some detail, but you are extracting some essential information about the data that you might have missed otherwise. For example, if you shrink that matrix of white, grey and black boxes down, it will probably look grey. Especially if other nearby grids have similar mixes of black, white and grey.

OK – Enough

If at this point you are thinking “OK Rached, you are getting in the weeds a little here and you’re losing me,” I’ve got you. We really don’t need to go much further than this. I would, of course, encourage you to dig into neural networks on your own, because it’s surprisingly easy to build and train them. There are two well known frameworks for working with neural networks: PyTorch and Tensorflow. Which one you choose is really up to you. I think PyTorch is more popular, but I personally opted to use Tensorflow. I took this course on Udemy and not only did it force me to sharpen my data engineering skills, it taught me the basics of how to build and train neural networks to do mind blowing things.

Why is it important? Because all that we hear about today are Large Language Models and other types of generative AI platforms. These, at the root, are neural networks. When you hear that, for example, Llama 3.1 has a 405 Billion parameter model, you now know what they mean by that. It means that if you add up all the weights and biases that need to be adjusted to train the model, there are 405 Billion of them (roughly speaking). When you hear that a model is fine-tuned for a purpose, you now understand that it means the model has been exposed to specific training data that will make even more tweaks to the weights and biases so that the neurons output more of the kinds of outputs you need them to.

And that’s all I really wanted to know when I learned about neural networks. I wanted to get a handle on the jargon and get a better sense of what it meant to build and train an AI. In the process, I actually picked up some pretty cool skills and if you have the time I encourage you to do the same. But my hope is that this blog is at least giving you the tools you need to wade into the conversation and get involved.

Where to from here?

I think that this blog, from this point forward, will become a little less like a curriculum and a little bit more random. I really wanted to share with you some of the very fundamental things that I felt were necessary for someone to know before entering the AI landscape. Now that we’ve done that, I think I want to share what I’ve actually been able to do with this knowledge. Since I started writing this blog, I’ve built several AI-based applications and I’ve actually taken on some work as a data scientist. My journey continues and I hope you’ll continue to accompany me as I stumble my way through this new world.

Thanks for reading.