Bài viết Blog

Deep Reinforcement Learning: Building A “Self-Driving Car” In Financial World.

Hieu Nguyen

Xuất bản 14 June, 2025
Cập nhật 22 June, 2025
8 phút đọc
Deep Reinforcement Learning: Building A “Self-Driving Car” In Financial World.

Summary:

  • What makes Deep Reinforcement Learning different?

  • Markov Decision Process and its application

  • How to find Optimal Policy using Q-learning?

  • Building a “Self-Driving Car” in financial world

When talking about deep learning, many seems to refer it to only computer science. Today, we will talk about a part of deep learning that utilize many areas of both natural science and social science: Deep Reinforcement Learning. Deep Reinforcement Learning has been changing our world every day. This learning method helps scientists to explore new disease treatments, help engineers to develop self-driving cars, and help investors to maximize their profit.

What makes Deep Reinforcement Learning different?

Reinforcement Learning refers to a Deep Learning area that the algorithms are trained based on reward and punishment. Reinforcement Learning is a trial and error self-learning process without a supervisor. They will receive reward signals for each of their action. Moreover, the reward and punishment will be delayed, meaning that the result of an action cannot be observed instantaneously. Instead, effects of an action will be visible many steps later.

Let’s consider an example of a self-driving car. Supposed that the target of the car is to go from point A to point B in the shortest amount of time. There are three actions that the algorithms can make: accelerate, break, or do nothing. A decision to accelerate seems to be very good in a short-term period since it shortens the time to reach to the destination. However, in a long run, it will increase the probability of making an accident.

In fact, reinforcement learning is a combination of many different fields ranging from mathematics, economics to neuroscience. For instance, in economics, people conducted research on game theory. In neuroscience, people studied how our brain function and make decisions. Similarly, in engineering, researchers learned about optimal control. Since in all of these subjects, people try to optimize the reward of a set of action, they can be considered a part of reinforcement learning.

(Source: David Silver)

Deep reinforcement learning has multiple applications in real life such as self-driving car, game playing, or chat bots. Come back to the previous example about the self-driving car. A model can learn how to drive a car by trying different sets of action and analyze reward and punishment. Another example is chat bots, in which the program can learn what and when to communicate.

Markov Decision Process and its application

Markov Decision Processes (MDPs) are decision-making models that are used in deep reinforcement learning. There are five elements in the MDPs: state, action, policy, reward, and discount factor. Let’s consider the following example. You are an investor and need to create an investment strategy for $10 million. Supposed that you have two options: stock A and bond B. The stock will give you 30% dividend ($3 million) but there is 5% chance that company A will go bankrupt and you will lose $10 million. The second option is to invest in bond B, which will give you only 10% interest ($1 million) but only have 1% chance that company B will go bankrupt and even if the company is bankrupted, you will only lose $4 million. Assume that we will only transfer between the two assets and have no other option, we have the following characteristics:

  • States (S): There are three states: $10 million in Stock A, $10 million in Bond B, and Bankruptcy (C)

  • Actions (a): We have three possible actions with both stock and bond (Buy, Hold, and Sell)

  • Policy(π): Let’s consider two policies: aggressive and conservative strategy. The aggressive policy is that you will invest in stock, meaning that if you have stock, you will hold it and if you have bond, you will sell bond and buy stock. The conservative policy is that you will invest in bond, meaning that you will sell stock and buy bond if you own the stock.

  • Reward is the amount of money that you will earn based on your action

  • Discount factor (γ): Due to the time value of money, the profit you earn today will worth more than the same amount that you will earn tomorrow. Supposed that the discount rate is 5.3%, meaning the discount factor is around 0.95

Now our question is how to evaluate which policy is better and should we switch policy based on the situation? In order to answer that question, we need to value the expected discount reward of an action and compare the value of each policy. The value function can be described using the Bellman equation, in which P is the probability matrix of state if applying policy π

Applying the Bellman equation to our example, we have the following result:

Aggressive Strategy

Conservative Strategy

The conservative method will give a better result in both of the states. Now, supposed that the discount rate is 0.99, meaning that the profit in the future is more important. We have the following result. Now, the aggressive policy works better.

How to find Optimal Policy using Q-learning?

Q-learning method

However, in the real world, our algorithm will not know all of the probabilities hence there is no way to “calculate” an optimal policy. To solve this problem, we will need to apply a method called Q-learning. Instead of the value function, we will take the Q-value that represents the value of taking a particular action in a particular state. We can calculate Q-value by adding the immediate expected reward to the best possible outcome of the onward states. In a particular state, the algorithms will exploit the action that has the best expected Q-value. In the beginning of the training process, the expected Q-value of all possible actions will be equal. After that, for every action, the algorithms will receive the reward signal from the environment and update the Q-values.

To explain Q-learning, let’s imagine that you want to train your dog to sit. At first, you put your finger on the ground and your dog doesn’t know what to do. It will try several actions such as barking, spin, or turn around but nothing happens. There is no reward for these actions. Only when you dog sits, you give her a cookie. Now it will update the Q-value for the sit action at the state that you point your hand on the ground to be better than other actions. After the training process, you dog will be able to sit whenever you put your finger on the ground.

The same terminology is applied in financial world. Let’s consider the above example but this time, the only thing you know is your possible set of actions (buy, sell, and hold). You know nothing about the environment and need to find the best way to invest. For every state, the algorithms will make a decision of buy, sell, or hold one of the two assets and observe the reward. After training, the algorithms will be able to make a policy that maximize our goal.

The Drawback of Q-learning

However, the drawback of the Q-learning method is that you may keep performing an action without trying another action. Supposed that you are having two lines of boxes. One line has all of green boxes with $1 and the other has red boxes with $100 but you don’t know the value inside each. Before you try, the expected value for both of the boxes will be 0. You first try the green box and get a reward of $1. The expected reward for the red box is still 0. Hence the algorithm will tell you to keep opening the green box.

To overcome the drawback, instead of following the action with the best expected Q-value, sometimes, we will perform a random action and observe the reward. Supposed that at the next state, we will not open the green box as the algorithm tell you but take a red box instead. You get the reward of $100, and now the expected value of the red box is updated.

Building a “Self-Driving Car” in financial world

Deep reinforcement deep learning seems really complicated and difficult to apply. However, be confident, since I Know First’s algorithms will help you with that. The AI algorithms of I Know First apply deep reinforcement learning to more than 10,000 assets over 30 markets and provide you with the market daily forecast. Thanks to the advanced technology, I Know First has helped a lot of clients, both institutional and individual, to explore the financial world. If we have Tesla for self-driving car, now we will have I Know First for financial market.

Conclusion

Reinforcement learning is a machine learning method in which the agent takis actions and receives reward signals. This method contributes to a lot of the real-life fields ranging from computer science to economics. Two important elenments of Deep reinforcement learning are Markov Decision Process and Q-learning. I Know First is one of the first companies that apply deep reinforcement learning in to daily market forecast. Keep in mind that deep reinforcement learning is a method to help you explore the rule of the world in general and in financial market in particular.

To subscribe today and receive exclusive AI-based algorithmic predictions, click here