What is the future of artificial intelligence?


I worked for a decade at NVIDIA, as a CUDA Devtech Engineer and Solution Architect to research deep learning techniques, and present solutions to customers to solve their problems and to help implement those solutions. Now, for the past 3 years I have been working on what comes next after DNNs and Deep Learning. I will cover both, showing how it is very difficult to scale DNNs to AGI, and what a better approach would be.

Here is a quick graph of Deep Learning and AGI, and how I am estimating they might grow:

What we usually think of as Artificial Intelligence (AI) today, when we see human-like robots and holograms in our fiction, talking and acting like real people and having human-level or even superhuman intelligence and capabilities, is actually called Artificial General Intelligence (AGI), and it does NOT exist anywhere on earth yet. What we actually have for AI today is much simpler and much more narrow Deep Learning (DL) that can only do some very specific tasks better than people. It has fundamental limitations that will not allow it to become AGI, so if that is our goal, we need to innovate and come up with better networks and better methods for shaping them into an artificial intelligence.

Lets look at:

  1. Where Deep Learning and Reinforcement Learning are today
  2. What their limitations are – what can the do and not do?
  3. The neuroscience of human intelligence
  4. A possible architecture to achieve artificial general intelligence

Today’s Deep Learning and Reinforcement Learning

Let me write down some extremely simplistic definitions of what we do have today, and then go on to explain what they are in more detail, where they fall short, and some steps towards creating more fully capable ‘AI’ with new architectures.

Machine Learning – Fitting functions to data, and using the functions to group it or predict things about future data. (Sorry, greatly oversimplified)

Deep Learning – Fitting functions to data as above, where those functions are layers of nodes that are connected (densely or otherwise) to the nodes before and after them, and the parameters being fitted are the weights of those connections.

Deep Learning is what what usually gets called AI today, but is really just very elaborate pattern recognition and statistical modelling. The most common techniques / algorithms are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Reinforcement Learning (RL).

Convolutional Neural Networks (CNNs) have a hierarchical structure (which is usually 2D for images), where an image is sampled by (trained) convolution filters into a lower resolution map that represents the value of the convolution operation at each point. In images it goes from high-res pixels, to fine features (edges, circles,….) to coarse features (noses, eyes, lips, … on faces), then to the fully connected layers that can identify what is in the image. The cool part of CNNs is that the convolutional filters are randomly initialized, then when you train the network, you are actually training the convolution filters. For decades, computer vision researchers had hand-crafted filters like this, but could never get results as accurate as CNNs can get. Additionally, the output of a CNN can be an 2D map instead of a single value, giving us a image segmentation. CNNs can also be used on many other types of 1D, 2D and even 3D data.

Recurrent Neural Networks (RNNs) work well for sequential or time series data. Basically each ‘neural’ node in an RNN is kind of a memory gate, often an LSTM or Long Short Term Memory cell. When these are linked up in layers of a neural net, these cells/nodes also have recurrent connections looping back into themselves and so tend to hold onto information that passes through them, retaining a memory and allowing processing not only of current information, but past information in the network as well. As such, RNNs are good for time sequential operations like language processing or translation, as well as signal processing, Text To Speech, Speech To Text,…and so on.

Reinforcement Learning is a third main ML method, where you train a learning agent to solve a complex problem by simply taking the best actions given a state, with the probability of taking each action at each state defined by a policy. An example is running a maze, where the position of each cell is the ‘state’, the 4 possible directions to move are the actions, and the probability of moving each direction, at each cell (state) forms the policy.

By repeatedly running through the states and possible actions and rewarding the sequence of actions that gave a good result (by increasing the probabilities of those actions in the policy), and penalizing the actions that gave a negative result (by decreasing the probabilities of those actions in the policy). In time you arrive at an optimal policy, which has the highest probability of a successful outcome. Usually while training, you discount the penalties/rewards for actions further back in time.

In our maze example, this means allowing an agent to go through the maze, choosing a direction to move from each cell by using the probabilities in the policy, and when it reaches a dead-end, penalizing the series of choices that got it there by reducing the probability of moving that direction from each cell again. If the agent finds the exit, we go back and reward the choices that got it there by increasing probabilities of moving that direction from each cell. In time the agent learns the fastest way through the maze to the exit, or the optimal policy. Variations of Reinforcement learning are at the core of the AlphaGo AI and the Atari Video Game playing AI.

But all these methods just find a statistical fit – DNNs find a narrow fit of outputs to inputs that does not usually extrapolate outside the training data set. Reinforcement learning finds a pattern that works for the specific problem (as we all did vs 1980s Atari games), but not beyond it. With today’s ML and deep learning, the problem is there is no true perception, memory, prediction, cognition, or complex planning involved. There is no actual intelligence in today’s AI.

The Neuroscience of Human Intelligence

To go beyond where we are today with AI, to pass the threshold of human intelligence, and create an artificial general intelligence requires an AI to have the ability to see, hear, and experience its environment. It needs to be able to learn that environment, to organize its memory non-locally and store abstract concepts in a distributed architecture so it can model its environment, and people in it. It needs to be able speak conversationally and interact verbally like a human, and be able to understand the experiences, events, and concepts behind the words and sentences of language so it can compose language at a human level. It needs to be able solve all the problems that a human can, using flexible memory recall, analogy, metaphor, imagination, intuition, logic and deduction from sparse information. It needs to be capable at the tasks and jobs humans are and express the results in human language in order to be able to do those tasks and professions as well as or better than a human.

The human brain underwent a very complicated evolution starting 1 billion years ago from the first multi-cellular animals with a couple neurons, through the Cambrian explosion where eyes, ears and other sensory systems, motor systems, and intelligence exploded in an arms race (along with armor, teeth, and claws). Evolution of brains then followed the needs of fish, reptiles, dinosaurs, mammals, and finally up the hominids lineage about 5-10 million years ago.

Much of the older parts of the human brain were evolved for the first billion years of violence and competition, not the last thousands of years of human civilization, so in many ways out brain is maladapted for our modern life in the information age, and not very efficient at many of the tasks we use it for in advanced professions like law, medicine, finance, and administration. A synthetic brain, focused on doing these tasks optimally can probably end up doing them much better, so we do not seek to re-create the biological human brain, but to imbue ours with the core functionality that makes the human brain so flexible, adaptable and powerful, then augment that with CS database and computing capabilities to take it far beyond human.

Because deep learning DNNs are so limited in function and can only train to do narrow tasks with pre-formatted and labelled data, we need better neurons and neural networks with temporal spatial processing and dynamic learning. The human brain is a very sophisticated bio-chemical-electrical computer with around 100 billion neurons and 100 trillion connections (and synapses) between them. I will describe two decades of neuroscience in the next two paragraphs, but here are two good videos about the biological Neuron

and Synapse

from ‘2-Minute Neuroscience’ on YouTube that will also help.

Each neuron takes in spikes of electrical charge from its dendrites and performs a very complicated integration in time and space, resulting in the charge accumulating in the neuron and (once exceeding an action potential) causing the neuron to fire spikes of electricity out along its axon, moving in time and space as that axon branches and re-amplifies the signal, carrying it to thousands of synapses, where it is absorbed by each synapse. This process causes neurotransmitters to be emitted into the synaptic cleft, where they are chemically integrated (with ambient neurochemistry contributing). These neurotransmitters migrate across the cleft to the post-synaptic side, where their accumulation in various receptors eventually cause the post-synaptic side to fire a spike down along the dendrite to the next neuron. When two connected neurons fire sequentially within a certain time, the synapse between them becomes more sensitive or potentiated, and then fires more easily. We call this Hebbian learning, which is constantly occurring as we move around and interact with our environment.

The brain is organized into cortices for processing sensory inputs, motor control, language understanding, speaking, cognition, planning, and logic. Each of these cortices evolved to have networks with very sophisticated space and time signal processing, including feedback loops and bidirectional networks, so visual input is processed into abstractions or ‘thoughts’ by one directional network, and then those thoughts are processed back out to a recreation of the expected visual representation by another, complementary network in the opposite direction, and they are fed back into each other throughout. Miguel Nicolelis is one of the top neuroscientists to measure and study this bidirectionality of the sensory cortices.

For an example, picture a ‘fire truck’ with your eyes closed and you will see the feedback network of your visual cortex at work, allowing you to visualize the ‘thought’ of a fire truck into an image of one. You could probably even draw it if you wanted. Try looking at clouds, and you will see shapes that your brain is feeding back to your vision as thoughts of what to look for and to see. Visualize shapes and objects in a dark room when you are sleepy, and you will be able to make them take form, with your eyes open

These feedback loops not only allow us to selectively focus our senses, but also train our sensory cortices to encode the information from our senses into compact ‘thoughts’ or Engrams that are stored in the hippocampus short term memory. Each sensory cortex has the ability to decode them again and to provide a perceptual filter by comparing what we are seeing to what we expect to see, so our visual cortex can focus on what we are looking for and screen the rest out as we stated in the previous paragraph.

The frontal and pre-frontal cortex are thought to have tighter, more specialized feedback loops that can store state (short-term memory), operate on it, and perform logic and planning at the macroscale. All our cortices (and brain) work together and can learn associatively and store long-term memories by Hebbian learning

, with the hippocampus being a central controller for memory, planning, and prediction.

Human long-term memory is less well known. We do know that it is non-local, as injuries to specific areas of the brain don’t remove specific memories, even a hemispherectomy which removes half the brain. Rather, any given memory appears to be distributed through the brain, stored like a hologram or fractal, spread out over a wide area with thin slices. We know that global injury to the brain, like Alzheimer’s – causes a progressive global loss of all memories, which all degrade together, but no structure in the brain seems to contribute more to this long-term memory loss than another.

However, specific injury to the hippocampus causes the inability to transfer memory between short term and long-term memory. Coincidentally, it also causes the inability to predict and plan and other cognitive deficits, showing that all these processes are similar. This area is the specialty of prominent memory neuroscientist, Eleanor Maguire

, who states that the reason for memory in the brain is not to recall an accurate record of the past, but to predict the future and reconstruct the past from the scenes and events we experienced, using the same stored information and process in the brain that we use to look into the future to predict what will happen, or to plan what to do. Therefore the underlying storage of human memories must be structured in an abstracted representation in such a way that memories can be reconstructed from some for the purpose at hand, be it reconstructing the past, predicting the future, planning, or imagining stories and narratives – all hallmarks of human intelligence.

Replicating all of the brain’s capabilities seems daunting when seen through the tools of deep learning – image recognition, vision, speech, natural language understanding, written composition, solving mazes, playing games, planning, problem solving, creativity, imagination, because deep learning is using single-purpose components that cannot generalize. Each of the DNN/RNN tools is a one-of, a specialization for a specific task, that cannot generalize, and there is no way we can specialize and combine them all to accomplish all these tasks.

But, the human brain is simpler, more elegant, using fewer, more powerful, general-purpose building blocks – the biological neuron, and connecting them by using the instructions of a mere 8000 genes, so nature has, through a billion years of evolution, come up with an elegant and easy to specify architecture for the brain and its neural network structures that is able to solve all the problems we met with during this evolution. We are going to start by just copying as much about the human brain’s functionality as we can, then using evolution to solve the harder design problems.

So now we know more about the human brain, and how the neurons and neural networks in it are completely different from the DNNs that deep learning is using, and how much more sophisticated our simulated neurons, neural networks, cortices and neural networks would have to be to even begin attempting to build something on par with, or superior to the human brain.

Here is a the video about neuroscience and AGI that I submitted to NVIDIA GTC 2021 Conference

How can we build an Artificial General Intelligence?

To build an AGI, we need better neural networks, with more powerful processing in both space and time and the ability to form complex circuits with them, including feedback. We will pick spiking neural networks, which have signals that travel between neurons, gated by synapses.

With these, we can build bidirectional neural network autoencoders that take sensory input data and encode it to compact engrams with the unique input data, keeping the common data in the autoencoder. This allows us to process all the sensory inputs – vision, speech, and many others into consolidated, usable chunks of data called engrams, stored to short-term memory.

Now to store it to long term memory, we process a set of input engrams to reside in a multi-layered, hierarchical, fragmented long-term memory. First we sort the engrams into clusters based on the most important information axis, then autoencode those clusters further with bidirectional networks to create engrams that highlight the next most important information, and so on. At each layer, the bidirectional autoencoder is like a sieve, straining out the common data or features in the cluster, leaving the unique identifying information in each engram, allowing them to then be sorted on the next most important identifying information. Our AI basically divides the world it perceives by distinguishing features, getting more specific as it goes down each level, with the lowest level engram containing the key for how to reconstruct the engram from the features stored in the hierarchy. This leaves it with a differential, non-local, distributed, Hierarchical Fragmented Memory (HFM), containing an abstracted model of the world, similar to how human memory is thought to work.

An example of our encoding process is processing faces. We encode the pictures of faces using the process above. Then we apply alternating layers of autoencoding and clustering to keep sorting those faces and encoding them by implicit features that could be eye color, hair style, hair color, nose shape, and other features (implicitly determined by the layers of autoencoding, and with bins for different classes of features overlapping) – to create a facial recognition system that just by looking at people, autoencodes their face and its features and can assign the associated name that was heard when they were introduced – to that person’s face. Later when we meet a new person, the memory structure and autoencoders are already there to encode them quickly and compactly.

It also encodes language (spoken and written) along with the input information, turning language into a skeleton embedded in the HFM engrams, used to reference the data with, to mold it with, with the HFM give structure and meaning to the language.

When our AI wants to reconstruct a memory (or create a prediction), it works from the bottom up, using language or other keys to select the elements it wants to propagate upwards, re-creating scenes, events, and people, or creating imagined events and people from the fragments by controlling how it traverses upwards. It is this foundation that all of the rest of our design is based on, as once we can re-create past events and imagine new events, we have the ability to predict the future, and plan possible scenarios, doing cognition and problem solving.

We may be able to build a very simple brain that demonstrates these principles, but to scale, we need Charles Darwin – evolutionary genetic algorithms. Basically we define every neuron, synapse, and neural network parameter and how they are organized into layers and cortices and brains – by a genome.

The human brain is represented by only 8000 genes, and decoded by the growth process during fetal developing. We will do the same, because we can’t run genetic algorithms directly on 100 billion neurons, but we can do so on a few thousand genes to run genetic algorithms on much more efficiently, then expand the cross-bred genes to 100 billion neuron brains.

So as we breed generations of ever more sophisticated artificial brains, with more efficient neural networks specialized for specific purposes, we want to steer it into being human-like, or at least able to act and think like a human. For one, we could apply the same cognitive tests we do for children, starting from age 5 and up, to develop them like a human child. Seems logical.

Then, as the AGI starts to grow up – we can pull a trick from the film VFX animation community – do a motion/performance capture of a person, recording their motion, facial expressions and speech, as they go through everyday routines, then setting our artificial brain to train on that dataset, and keep selecting the ones that perform best every generation till we get a human mimic. It will not be AGI, nor human-level intelligence, but it is the best we can do till we make these things have to think more.

To take that all the way to AGI, I would create multiple such AI mimics and put them to work in different professions, writing some specialty code, and evolving specific AIs for each profession, so they have a broad but shallow layer of being conversation bots, but deep skills in their profession.

Now if we have a network of hundreds of different professions, serving millions of clients at once, all with the same brain architecture, with common language and interaction capabilities, how do me make an AGI. Maybe we just network them and that becomes an AGI?

At the very least, we have a framework and input and output on which to train and evolve an AGI, so that all the specialty skills of each vocation are assumed by a more generalized AGI brain, and in the process, that AGI brain becomes better at all the skills humans excel at.

From there, we just keep going till we have a superintelligence, and beyond. Here is the diagram we started with. Deep neural nets don’t scale past a certain point because no matter how many more layers of neurons we add, no matter how much labelled data we train with, and no matter how much compute we throw at the problem, the underlying network model is too crude, too approximate, and gains no further cognitive or problem solving capability with these increases. It plateaus for simple problems and is incapable of solving more complex problems.

In our AGI design, we have mapped a very powerful, general purpose, analog spiking neural network computer on top of powerful NVIDIA GPU digital computers, and that flexible design not only scales with compute power, but also scales as those neural networks evolve faster and faster to become larger, broader, and more functional and efficient with time, giving an overall exponential increase in capability greater than the Moore’s law increase in the underlying processor power.