Learning from the Master: Using ChatGPT for Reinforcement Learning – part 2

In the first part of this series, we explored the capabilities of ChatGPT, a state-of-the-art language model developed by OpenAI, in assisting data scientists with tasks such as data cleaning, preprocessing, and code generation. In this second part, we will delve deeper into what ChatGPT generated and why it didn't work. We will discuss the specific challenges that come with using AI-generated code, and how to effectively address these issues to ensure the reliability and accuracy of the final product. Whether you're a data scientist or a developer, this post will provide valuable insights into how to use ChatGPT to improve your workflow and streamline your development process.

In the first instalment of this blog series, we explored the capabilities of ChatGPT in generating boilerplate code from well-defined problem statements. We also discussed the benefits of providing extensive detail on the desired code functionality and the performance of ChatGPT when given a skeleton code to fill.

While the results were impressive, and a good starting point for a data science project, it was an effort to make the code function.

In this part of the blog, I will walk you through the bugs and mistakes ChatGPT made. As for why ChatGPT made the mistakes, there are multiple reasons and I have explained some of the problems in the first chapter of the series.

Materials

I was having trouble figuring out a way to present this part of the blog.  There are plenty of bugs and mistakes ChatGPT made and to make it easier to understand, I’ll provide an explanation of the smaller pieces (attributes, functions, variables) in more detail. This post is written for developers and assumes that the reader has a basic understanding of Python programming.

I have chosen to explain the major changes I made on the function level. To see all of the changes, you will need to refer to the code in the repository provided.

This post will go through each function, explain what it is supposed to do, and what ChatGPT did wrong in my opinion. The actual fixes can be found by looking at the code in the repository.

Boilerplate ChatGPT code can be found at: chatGPT_RL_blog1/Boilerplates/chatGPT at main · solita/chatGPT_RL_blog1 (github.com)

The complete finished solution can be found at: solita/chatGPT_RL_blog2: All the material discussed in the second part of the chatGPT for reinforcement learning blog series (github.com)

Fixing the environment boilerplate

link to the full script: chatGPT_RL_blog2/environment.py at main · solita/chatGPT_RL_blog2 (github.com)

The init script

Docstub: “Initialise your state, define your action space, and state space”

My understanding of the errors that ChatGPT made:

The init for action_space that chatGPT generated was wrong since it generates a list of all possible pairs of integers between 0 and m-1 where m=5, including pairs where the integers are the same, and an additional pair [0, 0]. 

The agent can’t travel in a loop from state 1 to 1, eg. the pickup location and the drop-off location can’t be the same as per the problem description. The only exception is the offline action [0,0] when the agent chooses to go offline.

I fixed this so that the action_space init generates a list of all possible pairs of integers between 0 and m-1, where the integers in each pair are distinct and at least one of them is 0.

The requests function

Docstub: Determining the number of requests basis, and the location. Use the table specified in the MDP and complete it for the rest of the locations.

My understanding of the errors that ChatGPT made:

ChatGPT only handled pickup location 0 when m=5.

I added handling for the rest. Using a dictionary instead of the if-else structure suggested by ChatGPT. ChatGPT does not add the index [0] to indicate no ride action, the method just returned an empty list.

The reward function

Docstub: Takes in state, action and Time-matrix and returns the reward

My understanding of the errors that chatGPT made:

  • No-ride action is not handled correctly. No-ride action should move the time component +1h as described in the problem definition. The function was returning the reward=-C which does not correspond to the reward calculation formula: (time * R) – (time * C). time = total transit time from current location through pickup to dropoff (transitioning from current state to next state).
  • chatGPT is calculating the travel time from the current state to the next state and updates the location. chatGPT is doing a mistake, hour and day in a state tuple are integers.

ChatGPTs way of calculating the time it takes to transition (for the taxi to drive) from the current state to the next state results in returning arrays for h and d. 

This is due to the fact that ChatGPT is slicing the 4D Time Matrix in the wrong manner. ChatGPT is using two sets of indices, pickup and drop-off, to slice the array.

  • indices are needed to slice the array in the correct way. I broke the total transition time calculation into multiple steps for clarity

The next state function

Docstub: Takes state and action as input and returns next state

My understanding of the errors that chatGPT made:

  • chatGPT is calculating the travel time from the current state to the next state and updates the location. chatGPT is doing a mistake, hour and day in a state tuple are integers.

ChatGPTs way of calculating the time it takes to transition (for the taxi to drive) from the current state to the next state results in returning arrays for h and d. 

This is due to the fact that ChatGPT is slicing the 4D Time Matrix in the wrong manner. ChatGPT is using two sets of indices, pickup and drop-off, to slice the array.

  • indices are needed to slice the array in the correct way. I broke the total transition time calculation into multiple steps for clarity

Just to point out there is a repeated code problem in these functions since they make the same lookup, it should be refactored as a function, but I’ll leave that to the next part of the blog.

What was missing? A function to do a step:

In reinforcement learning (RL), a “step” refers to the process of transitioning from the current state to the next state and selecting an action based on the agent’s predictions. The process of taking a step typically includes the following steps:

  1. The agent observes the current state of the environment
  2. The agent selects an action based on its policy and the current state
  3. The agent takes the selected action, which causes the environment to transition to a new state
  4. The agent receives a reward signal, which indicates how well the selected action led to the new state
  5. The agent updates its policy based on the received reward and the new state

At each step, the agent uses its current policy, which is a function that takes the current state as input and produces an action as output, to select the next action. The agent also uses the rewards obtained from the environment to update its policy so as to maximize the cumulative rewards.

Looking at the code that was generated by ChatGPT all of the pieces were there. The step function I implemented is just a wrapper that uses the reward and gets the next state functions. Look at the solution in the repository for details.

Fixing the Q-Learning agent boilerplate

Link to the full script: chatGPT_RL_blog2/agent_chatGPT.py at main · solita/chatGPT_RL_blog2 (github.com)

The init script

Docstub: “Initialise the DQNAgent class and parameters.”

My understanding of the errors that chatGPT made:

“Everything was initialized properly, variable for the initial exploration rate epsilon was missing so I added that. “

The build_model method

Docstub: “Build the neural network model for the DQN.”

ChatGPT didn’t do any mistakes here, it builds a simple feed-forward neural network, and the input and output sizes are defined correctly.

The get_action method

Docstub: “get action from the model using epsilon-greedy policy”

My understanding of the errors that ChatGPT made:

“Transferred the epsilon decay method to the notebook. The ChatGPT-generated function is only choosing a random action or the action with the highest predicted Q-value. It should also be considering the possible actions that are available in the current state. Additionally, the function is only decreasing epsilon after each episode, while it should be decreasing epsilon after each sample. I don’t want to pass the environment class as a parameter to access the env.requests() function. We’ll just pass the possible action indices and actions and rewrite this function.”

The train_model method

Docstub: “Function to train the model on each step run. Picks the random memory events according to batch size and runs it through the network to train it.”

My understanding of the errors that ChatGPT made:

“This boilerplate from ChatGPT won’t quite do. It is updating the Q values for one sample at a time, and not using a batch sampled from the memory. Using a batch will speed up training and stabilize the model. “

Summary

Overall going through the boilerplate code that ChatGPT generated and fixing them took around 10 hours. As a comparison when I originally solved this coding problem and generated a solution with the help of Google, it took around 30 hours. The boilerplate provided as input had a clear impact on the solution, both with ChatGPT and me.

In the next part of the blog series, I’ll ask ChatGPT to propose optimizations to the fixed code and see if it makes or breaks it. My original solution will be available for comparison.