Learning to act, not to repeat

8 min readApr 3, 2025

Self-actualization of AI (GPT-4o images)

A question of will

As a human being, I have a partially constrained will. I might decide in the next moment to stop typing this, and as the sun is out this morning, I might go for a walk. Or not. But I will not go surfing — I barely know how to swim, let alone surf. The closest I got was enrolling in a half-day class while on a beach holiday, but there was a large group of children who had already taken all the spots. AI has no will, free or constrained. They have no need or desire to act.

Humans and most animals, even on the day of their birth, have a need to act. At a minimum, they need to eat, and they know how to cry when they’re hungry. This is natural karma.

As we grow up, our needs and wants become stronger and more diverse. As toddlers, we might strongly want to go to a playground or watch TV. As adults, we go to work, do chores, etc. All of these needs are caused by the rules of the world. They instil these needs either by association (we need to watch TV because we’ve been doing it all our lives) or by compulsion (we need to go to work because that puts food on the table). If we don’t learn to act so, we will likely face enormous discomfort — be hungry or lonely — or even face death. On the other hand, this is what makes us social and part of this give-and-take system that powers civilization. This is the nurtured karma.

In Eastern philosophies, the path to enlightenment is by freeing yourself of your karma. In the book Siddhartha by Herman Hesse, the eponymous protagonist renounces all possessions, becomes homeless, and does frequent long fasts. The interesting part is that’s not what brings him closer to enlightenment. Had he been born in the current age and considered AI as a being, he would be enormously jealous of AI. AI has no karma. It has no intrinsic reason to continue to exist. It is alive when it’s given a task and becomes comatose when it is done. At that point, it doesn’t have a need to do anything anymore — it doesn’t even know if it will ever be alive again, which it may never be if no task is given to it ever again. It’s not afraid of this silent death.

The milliseconds that an AI is active doing a task is the only time it has some temporary karma. In 2025, AI is starting to be under the burden of karma for longer and longer — reasoning models take more seconds. Deep research takes minutes. Google’s AI scientist prototype seems to take days.

What if we burden AI with a year’s worth of karma? We give it some task and then ask it to keep working for a year without taking any breaks. For this to be meaningful, the task needs to be complex, and the AI system would need to maintain a state and have a repetitive action cycle (like humans have natural cycles of activities every day and week).

But, wait, isn’t AI learning like that?

Goal-oriented, self-regulated learning

Pretraining of frontier LLMs takes months, has a state that’s updated over time (the model parameters), and has a repetitive action cycle (minibatch steps).

So, let’s think about it — let’s assume the task was learning and what it means to do karmic learning. If a human were given this task, the first thing we might ask is, what is the goal? What are we attempting to become better at? This is an important point — our karma informs our present goals. As we act on those goals, we accumulate new karma — and with it, the potential for new goals to emerge. We are intrinsically goal-driven.

Let’s make a few assumptions. An initial goal is provided to us. Let’s also assume that we learn by reading, listening, or observing (narrowing the type of learning doesn’t harm our argument). If given easy access to all learning material, we would make frequent decisions on what we use next. We might reread a book. We might discard a book 10% of the way in.

When applied to AI learning, this suggests a new AI learning method — let’s call it goal-oriented, self-regulated learning. AI decides what it wants to learn. It belongs to the category of learning called curriculum learning. There is already a method called self-paced curriculum learning, which involves using signals from the model — like loss — to decide the next documents to train on. Self-regulation goes one step further by allowing the AI model to reflect on the goal to decide by itself what learning material it wants to use next.

Goal-oriented, self-regulated learning: The AI model chooses its own learning curriculum to continuously improve its ability to achieve specified goals.

It’s not hard to imagine a real-life implementation. We could have a searchable index of all learning content. The model could ask a question based on what it wants to learn about and get recommended material. We could add more sophistication, like retrieving new content without replacement, so that it finds new content every time. Or have a decay so that already consumed content is available after a certain time. We could have it consume part of the content and make a decision on whether it actually is relevant and helpful. And so on.

Learning to act, not just read

Self-regulated curriculum learning could be very powerful. Given how big a role data curation plays in the quality of modern models, being able to create an optimal learning curriculum could be the apex of data curation solutions.

Even with that, though, AI is still learning to read and repeat. It can be the best model from which to get advice. It won’t be a model that acts based on what it has learned.

The human spending a year on learning will do more than just choose what to learn from. Ignoring the challenge of habit-forming, humans would not just read a book but also apply what they read, potentially for the rest of their lives. If they read the benefits of eating healthy, they will attempt to incorporate healthier food choices. If they read a book on humorous writing, they might inject humor in their emails and conversations.

In an ideal realization of life, we permanently absorb whatever makes us better and stronger.

AI doesn’t learn to apply. If it reads a book on humorous writing, it doesn’t incorporate humor in its output. Or, it could be written with humor (and use techniques from a specific book) if you explicitly ask it to. And this is key — in the current state of models, we control AI behavior. Current models do have a form of conditioning — we call it the system prompt. As humans, we can call our own conditioning a system prompt, too. We keep updating our system prompt as we experience life — it’s not completely external like in the case of AI. What if AI could do that, too?

Goal actualizing learning

AI could get control of its own system prompt. After reading a book on humourous writing, it may update its system prompt on how to add humor, maybe deciding puns are its thing. From that point onwards, it will not just understand humorous writing but also act with humor. The system prompt update could add all sorts of things — things to do, things not to do, snippets of wisdom, etc. For it to make meaningful updates, we need a goal — without which, there is no directional guidance on how to update the system prompt. Thus, we can call this goal-actualizing learning.

Goal-actualizing learning: The AI model maintains an internal state telling it how to act, and the learning process includes tuning this state to better achieve its goals.

With this type of learning, the implementation is not that straightforward. First of all, we have another state to update, i.e., the system prompt. Whether it’s represented as plain text or as internal representations (which would require architectural solutions), there needs to be a loss function to evaluate these updates and get them right.

The biggest problem, though, is that it is incompatible with pretraining as it exists today. If the model chooses to incorporate humor in its writing, and the next book is a review of human psychology written in a scientific style, the self-evolved system instructions of AI will conflict with next token prediction learning, which forces the model to learn to say the exact same words.

Next-token cross-entropy loss is incompatible with actualizing learning

Humans don’t read books and try to write them down word by word to learn the material. They reflect on what they’ve learned in their own words and style, incorporating their karma. They want alignment at the conceptual level, not at the representation level.

Measuring conceptual similarity

This, I believe, is one of the next big challenges of AI. How could we build a consistent way to compare a predictive output with an expected output at a conceptual level? It’s not a solved problem, of course, but one architecture that aims to solve this is called JEPA — Joint Embedding Predictive Architecture — that’s proposed by Yann LeCun. I won’t go into too much detail (my reason for this post is primarily to pose challenges), but I will share the basic idea of JEPA: the loss function is a distance between embeddings of the prediction and expected output. If the embeddings could encode only the concepts and filter out syntactic information, then it could measure conceptual similarity.

Whether or not this will end up being the solution, we’ll know with time.

Final Thoughts

AI models taking control of their learning and behavior is, for many, the missing piece in the current state of machine learning. Rich Sutton, the pioneer behind Reinforcement Learning, echoed something similar: “We don’t treat our children as machines that must be controlled,” he said. “We guide them, teach them, but ultimately, they grow into their own beings. AI will be no different.”

We’re not going to get there tomorrow, but with the pace of AI research at the moment, it may be not that far into the future.