DEPARTMENT OF APPLIED IT MIXED MEMORY Q-LEARNER An adaptive reinforcement learning algorithm for the Iterated Prisoner’s Dilemma Anna Dollbo Thesis: 15 hp Program: Cognitive Science Level: First Cycle Year: 2021 Supervisor: Robert Lowe Examiner: Alberto Montebelli Report nr: 2021:081 Abstract The success of future societies is likely to depend on cooperative interactions between humans and artificial agents. As such, it is important to investigate how machines can learn to cooperate. By looking at how machines handle complex social situations, so-called social dilemmas, knowledge about the components necessary for cooperation in artificial agents can be acquired. In this study, a reinforcement learning algorithm was used to study the Iterated Prisoner’s Dilemma (IPD), a common social dilemma game. A reinforcement learning algorithm can make decisions in the IPD by considering a given number of its opponent’s last actions, thus representing the agent’s memory. This study investigated the role of different memory lengths on the performance of the agent in the IPD. The results showed that different memory lengths are preferable depending on the opponent. A new algorithm was created called Mixed Memory Q-Learner (MMQL), which could switch memory length during play to adapt to its opponent. It could also recognise its opponent between games, thus continuing its learning over several interactions. MMQL performed better against certain opponents in the IPD but did not learn to cooperate with cooperative players. Further capabilities might therefore be added to the algorithm to invite cooperation, or the environment can be manipulated. The results suggest that flexibility in how a situation is represented and the ability to recognise opponents are important capabilities for artificial agents in social dilemmas. Keywords Machine learning, reinforcement learning, game theory, iterated prisoner’s dilemma, state representation, Q-learning. Sammanfattning Framgången för framtida samhällen kommer säkerligen bero på lyckade sammarbeten mellan människor och artificiella agenter. Därför är det viktigt att undersöka hur maskiner kan lära sig att samarbeta. Genom att studera hur maskiner hanterar komplexa sociala situationer, så kallade sociala dilemman, kan vi få kunskap om vilka komponenter hos artificiella agenter som krävs för att samarbete ska uppstå. I denna studie användes en förstärkningsinlärnings-algoritm i den repeterade versionen av spelet Fångarnas Dilemma. En förstärkningsinlärnings-algoritm tar i spelet beslut baserat på ett givet nummer av sin motståndares senaste drag. Detta fungerar således som minnet för agenten. En studie gjordes som undersökte betydelsen av olika längd på minne för utfall i spelet. Resultaten visade att olika minneslängd är att föredra beroende på motståndarens strategi. Med denna information skapades en ny algorithm kallad Mixed Memory Q-Learner (MMQL). Denna kunde byta minneslängd under spelets gång för att anpassa sig till sin motståndare. Den kunde också minnas och känna igen sin motståndare från tidigare matcher, och fortsätta lära sig där den sist slutade. Resultaten visade att MMQL presterade bättre mot vissa motståndare, men inte lärde sig att samarbeta med samarbetsvilliga motspelare. För att framkalla samarbete kan därför fler förmågor behöva läggas till hos algoritmen, eller så kan miljön manipuleras. Resultaten tyder på att flexibilitet i hur man representerar en situation, och förmåga att minnas sina motspelare är viktiga kompetenser för artificiella agenter i sociala dilemman. Nyckelord Maskininlärning, förstärkningsinlärning, spelteori, fångarnas dilemma, situations-representation, Q-inlärning. Foreword Acknowledgements I would like to express my gratitude to: - my supervisor Robert Lowe for challenging me to develop my ideas further, for his continued correspondence throughout the project and for sparking my interest in Artificial Intelligence. - all teachers at the Cognitive Science program for providing a multifaceted, thorough and deeply enriching education. - my partner Çağın for his loving support and patience with me throughout the project. Table of contents 1. Introduction 1 2. Theory 2 2.1 Game Theory and the Prisoner’s Dilemma 2 2.1.1. Game theory 2 2.1.2. The Prisoner’s Dilemma 3 2.1.3. The Iterated Prisoner’s Dilemma 4 2.1.4. The Axelrod Tournaments 6 2.2. Reinforcement Learning 7 2.2.1. Machine Learning 7 2.2.2. Reinforcement learning 7 2.2.2.2. Elements of reinforcement learning 9 2.2.2.3. Exploration vs. exploitation 9 2.2.3. Q-learning 10 3. Previous studies 12 3.1. Q-learning in the IPD 12 3.2. Memory in the IPD 13 4. Method 14 4.1. Design 14 4.2. Materials 14 4.2.1. Axelrod Python Library 14 4.2.2. Strategies 14 4.2.3. Algorithm choice and adaptations 15 4.3. Evaluation Metrics 16 4.4. Test 1 17 4.4.1 Results 17 4.4.2. Interpretation 19 4.4.3. Extension of Test 1 19 5. Mixed Memory Q-Learner 21 5.1. Algorithm description 21 5.1.1. Concept 21 5.1.2. Architecture 22 5.1.3. Limitations 23 5.1.4. Performance prediction 24 5.2. Test 2 24 5.2.1. Design 24 5.2.2. Parameter settings 24 5.2.3. Additional evaluation methods 24 6. Results 25 7. Discussion 32 7.1. Results analysis 32 7.2. MMQL analysis 33 7.3. Limitations 34 7.4. Future research 35 8. Conclusion 35 9. References 36 Acronyms AI: Artificial Intelligence APL: Axelrod Python Library IPD: Iterated Prisoner’s Dilemma MARL: Multi-Agent Reinforcement Learning MDP: Markov Decision Process MMQL: Mixed Memory Q-Learner ML: Machine Learning MOML: Multi-Objective Meta Learning MORL: Multi-Objective Reinforcement Learning PD: Prisoner’s Dilemma RL: Reinforcement Learning RMSE: Root Mean Square Error TD: Temporal Difference TFT: Tit For Tat 1. Introduction In recent years, research within artificial intelligence (AI) has begun investigating how to teach artificial agents to cooperate. This is deemed an important research area as future societies are likely to require cooperation between artificial systems and artificial systems and humans (Bertino et al., 2020). One way of studying cooperation is by looking at how cooperation emerges from certain tricky situations, so-called social dilemmas. Such dilemmas reveal tensions between collective and individual rationality since cooperation might lead to better outcomes for the group, but freeriding or exploitation can give higher rewards for the individual (Rapoport, 1974). Therefore, studying how AI behaves in such situations is an important measure in ensuring a safe implementation of AI systems into real-world decision-making. By framing the social dilemma as a game, a game-theoretic framework can be used. One of the most common such games is called the Prisoner’s Dilemma (PD). In this game, two players who cannot communicate with each other have to decide whether to cooperate or defect against the other player. If one player chooses to cooperate, but the other does not, the first player receives zero points, and the one who betrayed gets a high payoff. If neither cooperates, they both receive a small payoff, but they both receive a medium high payoff if both cooperate. In this way, choosing to defect against the other player might lead to the highest payoff, but choosing to cooperate is the best choice for both players in the long run. The best strategy in the one-shot PD is always to defect (Axelrod, 1984). However, this is not true in the so-called Iterated Prisoner’s Dilemma (IPD), where the game is extended to include multiple interactions with the same agent. The IPD is therefore used to study the emergence of cooperation between different agents (Glynatsi, 2020). One way of studying AI and cooperation is by testing different machine learning (ML) techniques on games like the IPD to see how these agents learn. Games provide an excellent platform for developing AI as they present complex and interesting problems and realise some of the long term goals of AI, such as creativity, affective interaction and general intelligence (Yannakakis & Togelius, 2018). One ML technique used to study the IPD is called reinforcement learning (RL). In RL, an agent learns to complete a task solely based on the rewards and punishments it receives from the environment (Sutton & Barto, 2018). RL solutions to the IPD are relatively few in the literature (Anastassacos & Musolesi, 2018). However, previous studies on AI in games have uncovered many difficulties and best practices in teaching an agent to learn in an unfamiliar environment. Understanding how these solutions work specifically in non-cooperative games with an asymmetrical payoff matrix is important for AI and cooperation research. In RL, besides the rewards or punishments received when taking actions in the game, the agent also makes decisions depending on the information available to them in each situation. Therefore, the state-representation (the way a situation is represented at a given time point) of the RL problem can significantly affect how the agent learns to play the game (Yannakakis & 1 Togelius, 2018). In the IPD, the state representation for an RL algorithm is often some number of the past actions taken by the opponent in the game (Sandholm & Crites, 1996b). This can be likened to the agent’s memory, and the agent can thus have a shorter memory (only remember the past action) or a longer memory (for example, the last five actions taken). Previous research on the IPD often uses tournaments to see which strategies are best in the game (Glynatsi, 2020). The first such tournaments showed that only considering the past action can be sufficient to win the tournament (Axelrod, 1984). However, since then, others have argued that a longer memory is more evolutionarily stable in the game (Li & Kendall, 2014). A study was conducted to answer the research question: how does different memory lengths affect the performance of an RL algorithm in the IPD? The research was done to inspire a new RL approach to the IPD. 2. Theory 2.1. Game Theory and the Prisoner’s Dilemma 2.1.1. Game theory Game theory is a field that analyses situations of conflict, cooperation and competition by mathematical tools and logic (Glynatsi, 2020). Interactions between self-interested agents (so-called games) are analysed where the outcome of the interactions depends on what the agents (players) think will be the best action for them to take depending on the action they expect the other players will take. The field was first formulated in 1944 by John von Neuman and Oskar Morgenstern and has since been widely used in economics, social sciences and biology (Ross, 2019). A game-theoretic analysis of an interaction asks who the players are, which actions are available to those players, what information different players have when they choose an action and what the payoffs to the different players are as a consequence of the interaction? This frame is then used to reason about the optimal strategy for each player in the game (Jackson, 2011). Game theory makes two critical assumptions about players in a game: they act to maximise their utility and are perfectly rational agents. Game theoretical analysis supposes that an economic agent has preferences, a concept called utility in game theory. The goal of an agent in a game then is to maximise that utility. In its simplest form, that an agent maximises their utility means that an agent acts in a certain way so as to make the something that they tried to bring about by their act more probable (Samuelson, 1938). An agent’s utility then can be understood as whichever states of the world that an agent prefers (which can include pure self-interested states or states that also benefit others) and the agent’s willingness to act to bring about these states (Leyton-Brown, 2008). Game theory also supposes that players in a game are perfectly rational agents (Poundstone, 1993). Rationality here refers to an agent's ability to process information. The agent’s expectations of other players’ behaviours in the game are rational as they are supposedly derived from a correctly weighted and statistically 2 sound use of all the information accessible to the agent (Ross, 2019). This conception of rationality has been challenged by Herbert Simon as unrealistic, who argues that rationality is bounded (Simon, 1982). This means that there are constraints to the information, time and processing capacity available to any agent. Still, the notion of perfect rationality is an essential assumption in game theory in order to find how such agents would act if they had perfect knowledge, memory and understanding of the game (Poundstone, 1993). A standard representation of a game is called a normal form game (Jackson, 2011). Normal form games are usually represented via a table or matrix. Each row corresponds to one possible action for player 1, and each column represents a possible action for player 2. Each of the cells contains the payoff or outcome to both players as a consequence of taking those actions. Player 1’s payoff is listed first. In the next section, an example of such a game matrix can be seen which describes the PD. 2.1.2. The Prisoner’s Dilemma PD is one of the most famous games in game theory and an example of a normal form game. It was created in 1950 by RAND scientists Merrill Flood and Melvin Dresher, and was later named “the Prisoner’s Dilemma” by their consultant Albert.W. Tucker for the story he told to illustrate it (Poundstone, 1993). The story goes that two people (person A and person B) have been arrested on suspicion of jointly having committed a crime. They are taken to the police station and put in separate rooms with no means of speaking or exchanging messages with each other. The prosecutor comes to each of them and admits that they do not have enough evidence to convict both of them on the principal charge. However, the prosecutor gives both of them a choice: if person A confesses to the crime and testifies against person B in custody, person A will go free. However, if person A does not confess, but person B confesses and testifies against person A, person A will be sentenced to three years in prison. If person A and person B confess to the crime, they will both go to prison for two years. But if none of them confesses, the prosecutor will still charge them for a lighter crime for which there is evidence, and they will both receive one year in prison (Jackson, 2011; Poundstone, 1993). To cooperate in the PD means to stay silent in the hope that the other player does the same so that both players will get a lighter sentence. To defect refers to confessing against the other player, which can lead to either escaping punishment altogether, or if the other player does the same, going to prison for two years. The game matrix for the PD can be seen in Table 1, where the payoffs represent the time lost to the player while in prison. 3 Table 1 Prisoner’s Dilemma game matrix Prisoner B Stays Silent (cooperates) Betrays (defects) Prisoner A Stays silent (cooperates) Each serves one year Prisoner A: gets three years Prisoner B: goes free Betrays (defects) Prisoner A: goes free Prisoner B: gets three years Each serves two years The PD is a non-cooperative, non-zero-sum game. In a non-zero-sum game one person’s loss is not necessarily another person’s gain and vice versa. This can be seen in Table 1’s payoff matrix because when both players defect or both players cooperate, the payoff is not zero. The PD is a non-cooperative game because the players cannot make any pact with each other about how they should play. This makes the PD a dilemma because if it was a cooperative game, the players could agree on which strategies to play and wouldn’t have to guess the other player’s chosen strategy (Rapoport & Chammah, 1965). The PD is considered the most challenging dilemma for cooperation to arise (Rand & Nowak, 2013). This is because the optimal strategy for the PD is to defect. This is easy to prove: if you are player A and think that player B will cooperate, then the best action to choose is to defect since this means that you will go free. However, if you believe player B will defect, then your best choice is also to defect because if you cooperate, you will get three years in prison, but if you defect, that sentence will be reduced by one year. No matter what you think the other player will do, the best action for player A is always to defect. Defection, therefore, is a dominant strategy of the PD, even though cooperation would give a higher cumulative reward for both players (Axelrod, 1984). The PD can be thought of as an abstract formulation of many situations in life where the best outcome for an individual rational agent causes mutual defection when everyone involved would instead benefit from cooperation. 2.1.3. The Iterated Prisoner’s Dilemma The IPD is an extension of the PD where several PD games are repeated (iterated) after one another. The IPD is considered a more realistic implementation of a social dilemma from a human perspective since people often have continuous interactions. When meeting the same people continuously, agents come to know each other and can remember the other agents’ previous actions. This way, a person can come to label another person as either “predictable”, “selfish”, or “cooperative”, for example. Thus, the possibility of meeting again enables the 4 emergence of cooperation since even a selfish player might want to cooperate in the present in order to receive a higher payoff in the future (Axelrod, 1984). In the IPD, the structure of the traditional PD can be generalised from its original setting to make it more suitable for general-purpose testing. This is done by extrapolating the payoffs from punishments in prison years to rewards while still maintaining the symmetrical relationship. The payoff matrix for the IPD can be seen in Table 2 where R refers to the Reward for mutual cooperation, P stands for the Punishment for mutual defection, T refers to the Temptation to defect, and S stands for Sucker’s payoff. Table 2 Prisoner’s Dilemma game matrix with numerical payoffs Player 2 C D Player 1 C R = 3, R = 3 S = 0, T = 5 D T = 5, S = 0 P = 1, P = 1 The PD assumes that two players who cooperate will fare better than two players who defect. For this condition to hold, the payoffs are constrained as seen in (1). T > R > P > S (1) The Temptation must be the highest reward, followed by the Reward, then Punishment and lastly, the Sucker’s payoff. To make this payoff matrix a social dilemma, another rule must hold, as seen in (2). 2R > T + S (2) This rule ensures that two Rewards have a higher payoff than Temptation and Sucker’s payoff combined. Applying this condition to the game ensures that even if two players are alternating between cooperation (C) and defection (D) between each round, this behaviour will not give a higher reward than choosing to cooperate on both rounds (Glynatsi, 2020). 5 In the IPD, there is no dominant strategy, and the choice of strategy can in many cases be a non-trivial problem. This is because optimal play in the IPD depends on the strategy played by the opponent (Axelrod, 1984). Therefore, being able to expect the opponent’s behaviour is a useful skill for any intelligent agent in the IPD. 2.1.4. The Axelrod Tournaments Even though research on the PD had been plentiful since the 1950s, it was political scientist Robert Axelrod who popularised the game in 1980 by creating computer tournaments of the IPD as a way of studying conflict. Axelrod (1980) wanted to find the best way to play the game and asked researchers from different fields to submit strategies that were to compete in a round-robin tournament. 13 strategies were submitted that all competed against the other strategies, themselves and a strategy called Random that defects or cooperates with a 50% probability. The tournament consisted of 200 rounds repeated five times to get a balanced estimate of the scores. The strategy TitForTat (TFT) was the strategy with the highest average score and thus the winner of the tournament. This was a surprising result since TFT was the simplest strategy presented, which cooperated on the first turn and then mimicked the opponent’s behaviour for the rest of the game. This meant that TFT could never beat any of its competitors but could end up even many times (Axelrod, 1984). To confirm the results, Axelrod ran a second tournament which consisted of 64 entries. This time also, TFT came out as the winner despite being surrounded by many cleverly designed and complex strategies. Axelrod analysed the strategy and concluded that the success of TFT was due to four properties, which he shaped into four suggestions on how to play the game (Axelrod, 1984): ● Don't be envious: don’t strive for a larger payoff than your opponent’s. ● Be nice: don’t be the first to defect. ● Reciprocate both cooperation and defection: be sensitive both to retaliation and apologies. ● Be clear: be predictable so the other player can learn to trust you. 6 2.2. Reinforcement Learning 2.2.1. Machine Learning ML is a subfield of (AI) consisting of techniques for making a machine construct a program from example data that can learn and improve automatically over time (Pumperla & Ferguson, 2019). AI, in turn, is a science and engineering practice in which different techniques are used to make a computer mimic human behaviour and cognition. In traditional computer programming, clear rules and instructions are written in a program that specifies how the computer should manipulate data and what the output should be. In ML, however, the computer is fed data and an indication of the expected outcome but is not given any instructions on how to create the output. Instead, the computer learns to match and manipulate the data to the expected outcome using an algorithm (Pumperla & Ferguson, 2019). An algorithm is a step by step procedure on how to solve a problem or reach a certain goal, which often includes repetitions of actions until that goal is reached (Merriam-Webster, n.d.). Algorithms are used both by traditional programming and ML but are in ML combined with an element of automaticity. Today, ML techniques and traditional programming are often combined, leveraging the interpretability and predictability of traditional programming with the automaticity and flexibility of ML methods (Pumperla & Ferguson, 2019). 2.2.2. Reinforcement learning RL is a class of ML- techniques initially inspired by biological learning systems where a computer agent learns from interaction with its environment. Unlike supervised learning, where the machine learns through being given explicit goals and feedback on their performance to these goals, RL is based on a reward signal provided to the agent as a response to its previous action. The reward signal the agent is given is numerical, and the agent tries to maximise this reward signal while interacting with its environment. This means that there is no prior knowledge of what constitutes good behaviour in a specific situation. Thus, the agent has to try different actions to learn which ones get the highest reward for each situation it experiences. This trial and error procedure and the fact that the agent receives the feedback signal (the reward) after its action are the two features which distinguish RL from other ML approaches (Sutton & Barto, 2018). RL is considered both a problem and a class of methods used to solve that problem. 2.2.2.1. Agent and Environment The two primary entities in an RL problem are the agent and the environment. These two entities communicate through actions, rewards, and observations (Lapan, 2018). An illustration of the interaction between agent and environment in RL can be seen in Figure 1. An agent is a goal-directed person or thing that interacts with its environment. The agent can sense the environment (make observations) and can execute actions that influence the environment. After each action the agent performs in the environment, it receives a reward as feedback on its previous action. This reward helps the agent learn about the environment and 7 how its actions influence it. Thus, the agent acts in the environment despite not knowing how that environment looks like or will respond to its actions. The agent’s primary goal is to maximise the reward it receives from the environment over time (Sutton & Barto, 2018). The environment is everything that exists outside of the agent. The agent only knows and come to know the environment through its interactions with it. Therefore, its knowledge of the environment will always be limited to what it learns through the rewards it receives, the actions it takes, and the observations it makes. Observations of the environment include any information besides the rewards that an agent gains from the environment. It is different from the state of the environment, which is the information of how the environment really is at that moment and which changes as the agent interacts with it. Observations are thus the information of the environment’s state that is available to the agent (Lapan, 2018). Figure 1 Illustration of the interaction between agent and the environment Note. Adapted from “A study of Q-learning considering negative rewards”, by Fuchida, T., Aung, K.T., & Sakuragi, A, 2010, Artificial Life and Robotics, 15, p. 352. Copyright 2010 by ISAROB. 8 2.2.2.2. Elements of reinforcement learning Sutton & Barto (2018) identifies some essential subelements of RL systems beyond the agent and the environment. These are policy, reward signal, and value function. A policy can be conceived of as the rulebook for the agent on how to behave at a specific time. This rulebook is updated as the agent learns, describing which action the agent should take in a certain state. Thus, it is akin to a set of stimulus-response associations in psychology (Sutton & Barto, 2018). Most policies are probabilistic, and will only tell you that a certain action is more probable to lead to a higher reward than another in a certain state. Hence, a policy is correctly described as a probability distribution over actions for every possible state (Lapan, 2018). The policy is the most critical part of an RL agent as it is the sole component that guides the agent’s actions (Sutton & Barto, 2018). The agent’s policy is obtained by framing an RL task as a Markov Decision Processes (MDPs), a framework from probability theory used to model sequential decision problems (Seiffertt et al., 2009). A detailed description of MDPs can be found in Appendix B. The reward signal is the most important information received by an RL agent, as the agent’s objective is to achieve the largest accumulated reward over time. The reward is a single scalar value that is received from the environment. The agent obtains the reward on a fixed schedule, which could be at each timestep or perhaps only once in its lifetime. Most commonly, a reward is received after each interaction with the environment (Lapan, 2018). The reward teaches the agent what actions are good and what actions are bad to perform in the environment. An action that leads to a less desirable state of the environment (from the agent’s point of view) would receive a negative or low reward. An action that leads to a preferable state of the environment would receive a positive or high reward. The value of the reward signal is informative of how well the policy is working and is thus the primary motivation for a policy update (Sutton & Barto, 2018). A value function informs the agent what is good for the agent long term (Sutton & Barto, 2018). A value function attaches a value to each state, which describes the total amount of reward the agent can expect to accrue in the future from the outset of that state. Thus, a value function is trying to calculate how much reward will be received from each of the consecutive states that are likely to follow the present state. Establishing the value of a state is a non-trivial but critical problem, as it is the value of the state which determines the best action to take in that state. Unlike rewards, the value of a state needs to be constantly re-evaluated and updated as the agent explores and learns from its environment. Finding a good method for state evaluation is one of the main challenges of RL. 2.2.2.3. Exploration vs. exploitation Another challenge in RL is managing the balance between exploration and exploitation (Lapan, 2018). The agent’s job is to maximise its accumulated reward over time, and so it should prefer to perform the actions that in the past led to high reward (exploitation). However, the agent doesn’t know the reward of these actions and must try actions randomly 9 first (exploration). If the agent only exploits previous efforts, it can get stuck in a local minimum; a suboptimal solution. However, if the agent only explores, it is unlikely that the agent will fulfil its task. A policy that only chooses the actions with the highest expected rewards is called a greedy policy (Kaelbling et al., 1996). Sometimes the exploitative policy is combined with a parameter epsilon ϵ, which manages the probability of taking a random action other than the one suggested by the greedy policy. This parameter ensures that the agent doesn’t get stuck following a policy that might have learned faulty action-state values prematurely in the task by providing new learning experiences. This parameter is often set high to begin with, which leads the agent to explore early in the task, and then slowly decreases so that the last part of the game is played with a greedy policy. The epsilon-greedy policy is one way of managing the exploration vs exploitation problem, but the problem is still an unsolved dilemma in RL (Sutton & Barto, 2018). 2.2.3. Q-learning Q-learning is arguably one of the most important algorithms in RL history (Sutton & Barto, 2018). It was created by Christopher Watkins (1989) and is a model-free approach to RL, meaning it has no preconceptions about how the environment works. Q-learning works by incrementally improving its evaluations of the merits of taking a certain action at a certain state (Watkins & Dayan, 1992). To learn an optimal policy for a problem, simply having information of the value of a state, V(s), is not informative of which action to take in that state. Q-learning introduces another value: Q-value, Q(s,a), which is the value of taking an action a in state s (a state-action value). The algorithm calculates the expected return for each action a that can be taken in state s. The value of the state can be reduced to the highest Q-value, as V(s) is the highest expected return that you can get from a state (Lapan, 2018). If Q-values can be calculated for each state then deciding on an optimal policy becomes a trivial problem (Watkins & Dayan, 1992). Q-learning uses a tabular approach, meaning it stores the Q-values in look-up tables that are updated and added to as the agent visits more states. This approach is ineffective when the state-space is very large, as the look-up tables require a large memory space. Therefore, function approximators, usually neural networks, are the preferred method for more complex problems (Yannakakis & Togelius, 2018). In Q-learning, the agent learns by using temporal difference (TD) learning. This means that the agent learns without a model of the environment and without waiting until the episode is over to discover the correct values for each state. Instead, the agent learns and behaves based on its own estimations of the state-values during run-time. At any time t, the agent performs an action a in state s, which takes it to a new state s’ at time t+1. Once it has arrived in the new state s’, it immediately receives the reward r of that state. Since the action the agent took was linked to a value estimation of the new state, the agent now has access to the actual value of that state. This means that the agent can now update the Q-value of state s in which the agent started. This is TD learning: the estimate of what is the best action to take in the next state is used to update the estimate in the current state. This way of incrementally updating the value of a state based on estimates is called bootstrapping (Sutton & Barto, 2018). 10 Q-learning is a so-called off-policy method, which means that even though the learned Q-values are used to update the policy, the agent doesn’t necessarily use this policy to decide which action to take (Sutton & Barto, 2018). Q-learning uses the estimate of the best action (the highest Q-value) to update the policy but can then select an action independent of the policy (Kaisers, 2012). This exploration of actions that diverge from the policy suggestions is often reduced over time so that by the end, Q-learning follows the optimal policy that it has been updating. Since Q-learning is still updating its Q-values based on the best actions in each state the algorithm is still able to converge to an optimal policy (Watkins & Dayan, 1992). The Q-value function that the agent uses to update its Q-values can be seen in (4). (4)𝑄(𝑆𝑡, 𝐴𝑡) ← 𝑄(𝑆𝑡, 𝐴𝑡) + α 𝑅𝑡 + 1 + γ𝑚𝑎𝑥𝑄(𝑆𝑡 + 1, 𝑎) − 𝑄(𝑆𝑡, 𝐴𝑡)[ ] This shows how the Q-value of taking action A in state S at time t is equal to: the Q-value for that state plus the reward received at the next state added together with the discounted highest Q-value for the new state minus the current Q-value, with the last three terms being weighted by the learning rate ⍺ (Sutton & Barto, 2018). The learning rate decides how much the agent should update the Q-values based on what it has learnt before. With a learning rate close to 0 the Q-values are updated slowly whereas a learning rate close to 1 means the Q-values are updated significantly based on the current sample (Lapan, 2018). The discount rate γ determines how much the agent should consider future rewards. With a high discount rate, the agent considers future rewards to be important, whereas a low discount rate means the agent mainly cares about short term rewards (Sutton & Barto, 2018). Q-learning is a flexible algorithm as it doesn’t need a model of its environment and is bound to converge (find the optimal policy) if it gets enough samples of its environment. This means Q-learning can easily be applied to stochastic games, where the reward received in each state is conditioned on the actions chosen by other players (Kaisers, 2012). A formal description of Q-learning can be seen in Algorithm 1. 11 Algorithm 1 Q-learning For each episode: 1. Initialise all states S 2. Loop for each time step of episode: - Choose action a from state s using policy derived from Q - Take action a, observe reward r and new state s’ - Update Q-value of previous state - State s’ is now the current state s 3. Repeat until terminal state 3. Previous studies 3.1. Q-learning in the IPD Q-learning has historically been used in single-agent use-cases where it has been shown to perform well in zero-sum games (games with a clear winner and loser). However, research using Q-learning and other RL approaches to investigate non-zero-sum games have been relatively few (Anastassacos & Musolesi, 2018). One early study that looked at the performance of Q-learning in the IPD was done by Sandholm & Crites in 1996. In this study, a Q-learning algorithm was pitched against TFT and itself. The Q-learning algorithms varied on three dimensions: the length of memory used as state representation, how they stored Q-values, and their exploration schemes (Sandholm & Crites, 1996a). The results showed that the tabular approach could learn optimal play against TFT much quicker than the neural network approach. When pitched against itself, a tabular approach gained a higher score than the neural network one. The results also showed that exploration was a crucial factor for not being exploited by other exploring strategies. A longer time of exploration in each game against two learning agents lead to fewer cases of consecutive defections and simultaneously fewer iterations of sequential cooperations. The mean payoff per game also decreased monotonically with longer exploration schedules. One study of the IPD used a decentralised probing method with Q-learning to gain insight into how other learning agents' strategies change as the game goes along (Anastassacos & Musolesi, 2018). This enabled the agent to take actions based on its understanding of the changes in its opponents’ behaviour. The results showed that the modified Q-learning algorithm could learn to cooperate against itself, thus avoiding the otherwise strong tendency for two Q-learning agents to fall into loops of defection. As the algorithm was combined with a function approximator, training was first needed to initialise the neural network. The study 12 showed how modifying the environment during training increased cooperation for Q-learning in the IPD. By creating an environment with different probabilities of meeting opponents and then introducing cooperation-maintaining strategies into this environment (such as TFT), the agents' cooperation rate rose, demonstrating how norms provided by other players can influence the behaviour of the algorithm. 3.2. Memory in the IPD Sandholm & Crites (1996a) tested the impact of different memory lengths in the IPD for Q-learning algorithms when playing against themselves. Learning against other learning agents is called Multi-Agent Reinforcement Learning (MARL) and is a more complex problem than single-agent RL. Throughout the game or task, the agents’ behaviours change as they learn, as does the environment, making it difficult to determine the value of a state since the reward received for one state-action pair might not be the same the second time tried (Hernandez-Leal et al., 2017). This is often referred to as the non-stationarity problem of the environment in MARL (Nguyen et al., 2020). The algorithms in the Sandholm & Crites study (1996a) were based on lookup tables and varied by three different lengths: the last one, two, or three of the opponents’ actions. The results showed that for memory length three, longer and more varied patterns of play developed. This led to the long memory players fairing slightly better than shorter memory ones, although only marginally. The results also showed that matches between two memory-one players gave rise to more cooperation. Glynatsi (2020) has also shown that strategies with longer memory have an advantage over memory-one strategies in the IPD. O'Riordan et al. (2003) argue that the disadvantage of a memory-one strategy like TFT is their tendency to fall into a spiral of mutual defection. This tendency can be dissuaded by using a longer memory of the game. This has been proved by Li & Kendall (2014) that showed how well designed longer memory strategies statistically receive a higher payoff than shorter memory strategies in interactions with more arbitrary opponents. This suggests that longer memory strategies are more evolutionarily stable than shorter memory strategies in the IPD. This aligns with the results seen in Lin et al.’s (2020) experiment using RL algorithms in the IPD. They argue that the benefit of having longer state representations is related to the fact that the learning systems in the two-player IPD undergo multiple states. Thus, without a more complex understanding of the context, the reward-mappings can be oversimplified, possibly leading to suboptimal learning of the problem. 13 4. Method 4.1. Design To test the performance of an RL algorithm with different memory lengths in the IPD, a round-robin tournament was run using a Q-learning algorithm against 9 default strategies. The tournament was chosen to evaluate the algorithm’s performance since it provides a game theoretical framework to study the IPD (Glynatsi, 2020). The aim of the study was to provide information on how state representations affected the overall performance in the tournament and the outcome against individual opponent strategies. The tournament was run with the standard 200 turns per game but was repeated 100 times instead of the original five. This was done to ensure a good estimate of the algorithm’s performance by taking into account the inherent stochasticity of the tournaments (Harper et al., 2017). The payoff matrix seen in Table 2 was used for the tournament. Each step of the game represented one PD game, where the actions available were Cooperation (C) or Defection (D). The players had no knowledge of the length of each game in the tournament. At the end of each round, the results were reset, and any memory the agents had stored were wiped. At the end of each tournament, the results were summarised, and the players were ranked based on the total average score. The run of the tournament was named Test 1. 4.2. Materials 4.2.1. Axelrod Python Library To perform the test, the Axelrod Python Library (APL) was used. The APL is an open-source library created with the principles and goal of easy reproduction of previous research on the IPD. Other goals of the library are to create a useful tool for future research and make it simple for anyone to contribute to the library of IPD strategies. The library hosts over 230 strategies of different categories and dimensions. Round-robin tournaments and matches with different variables can be created and analysed with the library (Axelrod Python Library, 2015). 4.2.2. Strategies The strategies chosen to test the algorithm are nine default strategies for the IPD identified by Jurišić et al. (2012). They differ in their initial actions, levels of cooperation, and in their retaliatory behaviours. The nine strategies were chosen for their simplicity as optimal play against some of the strategies can be easily discerned. They were also chosen based on some of their previous success in the original Axelrod tournament. The strategies were selected with the limitation of the RL algorithm in mind, all being static opponents that creates a stationary environment. These strategies consequently worked as good benchmark strategies to test the RL agent’s learning capability and the benefit of different memory lengths. The nine strategies are described in Table 3. 14 Table 3 The nine default strategies used in the tournament. Name Description Cooperator Always cooperates Defector Always defects TitForTat (TFT) Starts by cooperating then mimics the moves of the opponent Grudger Cooperates until the opponent defect, then defects the rest of the game Random Plays cooperation or defection with a probability of 50% TitForTwoTats (TF2T) As TFT but defects after two successive defections SuspiciousTitForTat As TFT but starts with defection TwoTitsForTat (2TFT) As TFT but retaliates with two defections for every defection Win-Stay Lose-Shift Results for an action are divided into 2 groups: positive actions (T and R), and negative actions (P and S). If the previous action’s result belongs to the first group, the action is repeated, and if the result belongs in the second group, the action is changed. This strategy is also called Pavlov. 4.2.3. Algorithm choice and adaptations Q-learning was chosen for an algorithm in the study based on its status as a baseline RL algorithm, its applicability on stochastic games and its ready availability within the Axelrod Python library. Based on the results of previous studies, a tabular approach was chosen (Sandholm & Crites, 1996a). The known limitations of Q-learning in MARL motivated a choice of opponent strategies that would contribute to a stationary environment. These could thus highlight the learning ability of the algorithm for more static and predictable opponents while still giving some indication of its performance in a non-stationary setting as it also plays itself in the tournament (learning agent vs learning agent). The Q-learning algorithm used in the experiments was adapted from the Risky Q-Learner algorithm in the APL. The Risky Q-Learner was named based on its high learning rate setting, meaning it makes quick assumptions, which can be risky as it can lead to premature convergence and a suboptimal solution to the problem (Yannakakis & Togelius, 2018). The Risky Q-Learner has a state representation containing the last N moves of the opponent as an adjustable parameter together with the total cooperation count over the opponent’s history. An example state representation would be ‘CCDDD:34’, informing the agent of the opponent’s last five actions and that the opponent had cooperated 34 times in the game so far. A pre-run of the tournament showed that this state representation was suboptimal since the additive element of the cooperation count made the agent experience every state as new given 15 the opponent played more cooperations. An evaluation of the code revealed that the Q-function of the Risky Q-Learner was not corresponding to the original description of the algorithm. Changes were made to the algorithm so that the state representation only consisted of the N number of states of the opponent’s last actions. This state representation was in line with previous research of RL in the IPD (Lin et al., 2020; Sandholm & Crites, 1996b). The Q-function was adjusted to correspond to the original version. The option of annealing epsilon value was also added to the algorithm, ensuring that the agent could start with a certain probability of choosing random actions that then diminished over time. The new version of the algorithm was named Regular Q-Learner. 4.2.4. Algorithm parameter settings To test the implications of different memory lengths for an RL algorithm in the IPD, the Q-learning algorithm was run with three different memory lengths: 1, 5, and 10 (Mem1, Mem5 and Mem10). This corresponds to the state representation of the algorithm as the last N actions that the opponent played. Mem1 and Mem5 were used to represent the agent’s short and mid-length memory capacity and were motivated by results in previous studies (Lin et al., 2020; Sandholm & Crites, 1996b). Mem10 was chosen experimentally to represent a longer memory as test runs showed a higher cooperation rate and total score than Mem1 or Mem5. Hyperparameters alpha (learning rate) and gamma (discount rate) were held constant at ⍺ = 0.5 and γ = 0.95. The learning rate of alpha 0.5 was chosen experimentally as a mix between learning rate 0.2 as seen in Sandholm & Crites (1996b) work and a learning rate of 1 that slowly decreases as seen in Lin et al.’s (2020) work. A high discount rate was chosen as cooperation can only occur in the IPD if the future is considered important enough for the present action selection (Axelrod, 1984). This has also been proved experimentally (Sandholm & Crites, 1996b). Exploration was managed with an epsilon greedy policy with an epsilon value of 0.1 that slowly decayed so that the last third of the game was played with a deterministic policy. 4.3. Evaluation Metrics 4.3.1. Choice and motivation The performance of the algorithm in the tournament was evaluated with four metrics: total average score per tournament, rank, normalised payoffs per opponent per turn and cooperation rate. As there exists a variety of metrics to analyse strategy behaviour and results in the IPD, these metrics were chosen as they could easily depict the overall performance of the agent in the tournament as well as how well the agent was fairing against individual strategies. The cooperation rate was added as the IPD is often used to study the evolution of cooperation, and as such, it is considered an important metric to report (Glynatsi, 2020). As the focus was to find general patterns that could be used as inspiration for a new model, an analysis of statistical significance in the results was omitted. However, to better visualise the difference in results, 10 percent error bar plots were created. 16 4.3.2. Total average score per tournament & rank The total average score determines the winner of the tournament, which corresponds to the rank of the players. Thus, measuring the total average score and rank is a good indication of how well the algorithm performs compared to other strategies. 4.3.3. Normalised payoffs per opponent and turn This metric gives a good indication of which strategy the algorithm chooses to play against each opponent. Based on knowledge of the opponent strategy, the normalised payoffs per opponent and turn can tell whether the agent has learnt mainly to cooperate or defect against the opponent. The normalised payoffs per opponent and turn are bounded by the payoff matrix and will therefore be a number between 0 and 5 (Sucker’s Payoff and Temptation). 4.3.4. Cooperation rate Cooperation rate is a morality metric created by Singer-Clark (2014), which describes how many turns the agent cooperated over the whole tournament out of all repetitions of the tournament. The cooperation rate can be informative of the agents’ performance in relation to its total average score and normalised payoffs per opponent and turn. Cooperation rate values range from 1.0 (always cooperates) to 0.0 (always defects). 4.4. Test 1 4.4.1. Results The results of Test 1 can be seen in Table 4. Mem10 obtains the highest score of the three algorithms at 4021, with Mem5 coming second at 4009 and Mem1 doing the worst at 3815. Despite this, it is Mem10 that has the lowest rankings of the three algorithms, which indicates high stochasticity in the tournament. Mem10 performs best against the strategies from the TFT-family but has the lowest score for both Defector and Grudger, which are exploiting strategies (plays majority D). It is instead Mem1 that performs best against the Defector, and Mem5 that performs best against Grudger. Mem1 also performs best of the three against Win-Stay Lose-Shift. The results show that Mem1 has the lowest cooperation rate at 0.48, with Mem5 and Mem10 having higher cooperation rates at 0.67 and 0.71 respectively. The difference in normalised payoffs per opponent and turn for the three different memory lengths is minimal for some opponents and quite considerable for others, as can be seen in Figure 2. Against 2TFT for example, Mem10 scores 0.82 more than Mem1, but for Grudger the difference in payoff for the two is only 0.03. 17 Table 4 Results for Test 1 Mem1 Mem5 Mem10 Average score per tournament 3815 4009 4021 Rank 7/10 7/10 8/10 Cooperation rate 0.48 0.67 0.71 Mean payoff per opponent and turn Cooperator 3.80 3.81 3.70 Defector 0.87 0.85 0.83 TitForTat 2.22 2.74 2.83 Grudger 0.97 0.98 0.95 Random 2.00 1.71 1.57 2TitsForTat 1.74 2.38 2.56 SuspiciousTitForTat 2.15 2.75 2.83 TitFor2Tats 2.98 3.07 3.08 Win-Stay Lose-Shift 2.35 1.87 1.77 Regular Q-Learner 2.38 2.68 2.77 Note. The bold face indicates that the player in column i has the highest mean payoff per opponent and turn among the players. 18 Figure 2 Differences in mean payoff per opponent and turn for the three memory lengths 4.4.2. Interpretation Since the TFT-strategies are cooperation-maintaining strategies (they will punish you for defection and reward you for cooperation), it is not surprising that Mem10 performs best against them as it has the ability to represent more complex state-reward mappings, leading to a higher cooperation rate. Likewise, the lower cooperation rating of Mem1 seems to indicate that the agent has learnt better (or faster) how not to be taken advantage of by exploiting strategies (like Defector and Win-Stay Lose-Shift) and how to better exploit naively cooperating strategies (like Cooperator). Since Mem10 has more state-representations to evaluate, it takes the algorithm longer to decide on good Q-values for each state. This delay in learning means that Mem10 learns optimal play against exploiting strategies much slower than Mem1 and Mem5. This seems to be a disadvantage of a longer memory for exploiting opponents but an advantage against opponents with whom it is better to cooperate. This raises the question of whether a longer memory means higher cooperation in general or whether this is simply because 200 turns per game were not sufficient for the algorithm to find optimal play against certain opponents. 4.4.3. Extension of Test 1 To determine whether a prolonged learning time would give better results, the tournament was rerun but with 2000 turns per game and 50 repetitions. The results can be seen in Table 5. Again, it is Mem10 that obtains the highest score at 42234 followed by Mem5 at 41321 and 19 Mem1 at 39795. The trends seen in the first tests are mainly still present. However, the ranking has now changed so that both Mem5 and Mem10 place 6th in the tournament. This means that Mem10 has advanced two ranks compared to Test 1. The cooperation rate of all algorithms have decreased given more time to learn and it is still Mem1 that performs best against exploiting strategies, and Mem10 that performs best against the TFT-family. The results indicate that a longer run time has improved the algorithms’ ability to avoid being exploited and to exploit naively cooperating strategies, while mainly maintaining the scores of Test 1 for more complex opponents (TFT-family). The differences in mean payoff per opponent and turn is visualised in Figure 3. Table 5 Results for the extended Test 1 Mem1 Mem5 Mem10 Average score per tournament 39 795 41 321 42234 Rank 7/10 6/10 6/10 Cooperation rate 0.24 0.51 0.59 Mean payoff per opponent and turn Cooperator 4.77 4.76 4.75 Defector 0.95 0.95 0.94 TitForTat 1.87 2.58 2.76 Grudger 0.97 0.96 0.96 Random 2.67 1.84 1.60 2TitsForTat 1.20 1.85 2.30 SuspiciousTitForTat 1.71 2.60 2.75 TitFor2Tats 2.91 3.04 3.06 Win-Stay Lose-Shift 2.85 2.09 2.00 Regular Q-Learner 1.64 2.52 2.68 Note. The bold face indicates that the player in column i has the highest mean payoff per opponent and turn among the players. 20 Figure 3 Differences in mean payoff per opponent and turn for the three memory lengths 5. Mixed Memory Q-Learner 5.1. Algorithm description 5.1.1. Concept Given the results of Test 1, a new strategy was proposed: Mixed Memory Q-Learner (MMQL). This strategy leverages the two insights gained from Test 1: that different memory lengths (state-representations) are preferable depending on the opponent's strategy. And secondly, that a longer learning time could lead to better results in the tournament. The first distinguishing feature of the MMQL is the ability to recognise its opponents between tournament repetitions, thus starting its learning where it ended the last time it played the same opponent. This means that the agent could participate in a standard Axelrod tournament with 200 turns over five repetitions and still learn over 1000 turns. Axelrod (1984) claims that the ability to discriminate between individuals is an essential skill as this enables agents to treat various individuals differently. In combination with memory, this skill can be used to punish defectors from the start in the subsequent interaction while rewarding cooperators with immediate trust and cooperation. The second distinguishing feature of the MMQL is the ability to switch memory length during play. This means the algorithm 21 performs an evaluation during run-time of which memory length gives the highest payoff and then switches if another memory length is better than the current one used. This enables the strategy to adapt to its opponents and to be self-reflective about its performance. 5.1.2. Architecture MMQL is a meta-strategy that uses a team of three different Q-learning algorithms simultaneously to find the best policy. The Q-learning algorithms are the Regular Q-Learner with three different memory lengths as analysed in Test 1 (Mem1, Mem5 and Mem10). The algorithm works by letting all three players play the opponent at the same time. During play, there is always one player which has the status as the current policy. This means that the action played at each turn of the game, and the reward signal received, is based on this player’s policy. This approach, where each agent has its own individual policy but optimises it from a joint reward signal, is called concurrent learning (Gupta et al., 2017). If MMQL has not met an opponent before, the current policy player is chosen randomly. The first 40 turns of each game are used to collect knowledge and initialise Q-values (similar to the concept of experience replay buffer, see for example Lapan (2018)). At turn 40 and up till turn 180, an evaluation will take place at every 20th step (turn). This evaluation is done by looking at the last 20 steps played by each of the three players and measuring the average reward received over those 20 steps. If one player has a higher average reward than the current policy, this player is chosen as the new current policy (similar to the concept of training two policies at once and then switching if one policy outperforms the other by a certain threshold in Silver et al. (2017)). If two policies have the highest average reward, the current policy is randomly selected from the two. This is true unless the current policy is within that list, in which case the current policy is chosen to continue. To balance the variance in Q-values created by switching policies, the probability of changing the current policy gets smaller as the game goes on. At the end of the game, the Q-values of all the players are stored away and remembered based on the opponent’s name. The next time MMQL meets the opponent, the Q-values from the previous match are loaded, and the player that was the current policy at the end of the last game begins in that position again. MMQL is similar to the model used by Hoang et al. (2018), where a multi-agent system is used to generate best-response strategies to particular opponent behaviours, that are then selected from by a high-level planner and adjusted as the opponent behaviour changes. A formulation of MMQL can be seen in Algorithm 2. 22 Algorithm 2 Mixed Memory Q-learning For every step in game do: 1. If beginning of game: Check if has met opponent before: if True: - load previous Q-values - set current policy to previous best policy if False: - initialise all states S - randomly select from the 3 policies 2. For all 3 players do: - play opponents 3. Check if at evaluation point: if True: - perform policy evaluation - update current policy 4. Check if end of game: if True: - unload Q-values 5. Play action chosen by current policy 6. Update Q-values based on reward signal 7. Repeat until game ends 5.1.3. Limitations To account for the exploration in the algorithms, the evaluation window was set to 20 steps to ensure that there was enough possibility for variance in all players’ results to make a just assessment. Due to more random actions being taken early in the game, one player might by chance get a higher reward than another player. However, as that player is then chosen as the current policy, it will become clear whether its policy has benefits, and if not, it will be sorted out at the next evaluation point. As the exploration rate and the probability of switching policy get smaller throughout the game, the evaluation should become more fair later in each game. This instability in the evaluations at the beginning of each game is deemed to be a necessary tradeoff to ensure that MMQL does not get stuck in a local minimum and continues to explore new ways of acting. Another potential problem with MMQL is that the players all update their Q-values based on the state they find themselves in, which is a direct result of the actions of the current policy. If a player is not the current policy, it might form a conception of the opponent’s behaviour that is not based on its own actions but of the actions of other players. If the players do not get a 23 chance to play and explore as the current policy, their Q-values may mainly resemble other players’, thus creating a blending effect. However, the high exploration rate at the beginning of each game, and the difference in state-representations for the three players, is expected to make sure that their strategies will still be distinguishable from one another. 5.1.4. Performance prediction The idea of MMQL is to combine the best play of each memory length to get a higher long-term return. MMQL is expected to sense which memory length to use as state-representation for each opponent and to prioritise the use of that memory length as its current policy. By doing so, it should be able to both avoid being exploited by exploitative opponents and cooperate with cooperative opponents. By continuous learning over tournament repetitions, the agent should be able to settle on a good policy for each opponent. This will enable the agent to take part in the standard Axelrod tournament without losing all its memory between rounds, thus performing better overall. 5.2. Test 2 5.2.1. Design To test the performance of MMQL, a standard Axelrod tournament was run with 200 turns per game over five repetitions. The MMQL played the same nine default strategies used in Test 1. A tournament was also run for the three Q-learning algorithms from Test 1 but with added ability to recognise and continue to learn over the whole tournament (same as MMQL). This was done to discern what effect MMQL’s policy switching ability had on the result, and whether it had any benefits to using the same memory length throughout. 5.2.2. Parameter settings The parameters for the three Q-learners used individually and in the MMQL were the same as in Test 1. Learning and discount rates were held constant at ⍺ = 0.5 and γ = 0.95. The players used an epsilon greedy policy starting at 0.1 which slowly decreased over the game. The exploration parameter was reset at the beginning of each game. 5.2.3. Additional evaluation methods To evaluate the performance of MMQL, two additional methods were used. One method checked for convergence of state-values over the games, thus showing if an extended learning time had enabled the players to settle on a good policy for each opponent. This was done by measuring the difference in state-values at every five steps of each game and calculating the root mean squared error (RMSE) between the current and last evaluation point. The RMSE is a good evaluation metric for model performance when the error distribution is assumed to be Gaussian (Chai & Draxler, 2014). These results were then plotted to see how the values changed over the tournament and whether the state-value updates were small enough to indicate that the algorithms had settled on a policy at the end of the tournament. 24 The second evaluation method measured how much of each game (in percentage) each player played as the current policy. This was done to indicate whether the algorithm had leveraged the findings of Test 1, thus choosing the memory length that had performed best against each opponent as its current policy for the majority of the tournament. These measurements were then plotted to give a visual indication of the results. 6. Results The results from Test 2 can be seen in Figure 4 and Table 6. It is still Mem 10 that gets the highest average score per tournament at 3911.2, with MMQL coming second at 3826, Mem5 third at 3660 and Mem1 last at 3521. Despite the predictions, none of the algorithms placed higher in the tournament than rank 7. Compared to the extended Test 1, the cooperation rate has decreased significantly for all algorithms. Mem10 has the highest cooperation rate at 0.4, followed by MMQL at 0.23, Mem5 at 0.18 and Mem1 at 0.08. Regarding mean payoff per opponent and turn, MMQL gets the highest score against Defector at 0.975, with Mem5 coming a close second at 0.970. It also has the highest score for Grudger at 1.01, same as Mem1. Besides this, the highest mean payoff per opponent and turn follows the same pattern as the results in the extended Test 1. This means that it is still Mem1 that performs best against Cooperator, Random, and Win-Stay Lose-Shift and Mem10 that performs best against the TFT-family and itself. However, MMQL comes in as second best for Cooperator, TFT, SuspiciousTFT, TF2T, Win-Stay Shift-Lose and against itself. The differences in mean payoff per opponent and turn are visualised in Figure 5. Figure 4 The tournament rankings of each strategy with the winner, Tit For Tat, to the left. Y-axis shows mean payoff per opponent and turn. 25 Table 6 Results for Test 2 Mem1 Mem5 Mem10 MMQL Average score per tournament 3521 3660 3911 3826 Rank 7/10 7/10 7/10 7/10 Cooperation rate 0.08 0.18 0.40 0.23 Mean payoff per opponent and turn Cooperator 4.94 4.91 4.89 4.93 Defector 0.97 0.97 0.96 0.98 TitForTat 1.16 1.58 2.28 1.99 Grudger 1.01 0.97 1.00 1.01 Random 2.63 2.37 1.69 2.31 2TitsForTat 1.02 1.08 1.22 1.05 SuspiciousTitForTat 1.16 1.57 2.51 1.89 TitFor2Tats 1.87 2.41 3.08 2.41 Win-Stay Lose-Shift 2.85 2.42 1.93 2.58 Regular Q-Learner/ MMQL 1.24 1.34 2.47 1.79 Note. The bold face indicates that the player in column i has the highest mean payoff per opponent and turn among the players. 26 Figure 5 Differences in mean payoff per opponent per turn When looking at which player in MMQL is used most as the current policy for each opponent, we can see that MMQL does leverage some of the results from Test 1. According to those findings, Mem1 should give the best results for Cooperator, Defector, Grudger, Random and Win-Stay Shift-Lose. Mem1 is used as the current policy most out of the three players for Defector (70% of the tournament), Grudger (48% of the tournament), Win-Stay Shift-Lose (60% of the tournament) but not for Cooperator (32% of the tournament), where instead Mem5 is used the most (36% of the tournament). Mem5 is also used the most for Random (42% of the tournament) compared to Mem1 (32% of the tournament). If using the results from Test 1 as a guide, MMQL should play Mem10 for the majority of the game against the TFT-family and against itself. However, Mem10 is used most times only against TF2T (44% of the tournament). Instead, Mem5 is used most against TFT (48% of the tournament), 2TFT (50% of the tournament), and MMQL (42% of the tournament). Mem1 is used most against SuspiciousTFT (50% of the tournament). Calculating the distribution of use of each player as the current policy over the whole tournament for all strategies, we can see that Mem1 is used most (36,6% of the time), closely followed by Mem5 (34,8%) and Mem10 is used the least (28,6%). Table 7 shows how many percent each player played as the current policy against each opponent over the five tournaments. 27 Table 7 Percentage of the game played with each player of MMQL against each opponent Players of MMQL Mem1 Mem5 Mem10 Opponents Cooperator 32% 36% 32% Defector 70% 28% 2% TitForTat 12% 48% 40% Grudger 48% 16% 36% Random 32% 42% 26% TwoTitsForTat 10% 50% 40% SuspiciousTitForTat 50% 40% 10% TitFor2Tats 30% 26% 44% Win-Stay Lose-Shift 60% 20% 20% MMQL 22% 42% 36% Note. The bold face indicates that the player in column i has been used for most of the time against the opponent. 28 Figure 6 Percentage played with each player for a) Cooperator, b) Defector, c) Tit For Tat, d) Tit For 2 Tats The convergence plots show that the state-values for both MMQL and the Regular Q-Learners get disturbed by the off-and-on loading of Q-values between rounds. This can be discerned from the four abrupt spikes of change in RSME seen at the end of each game. This effect is more apparent for MMQL than the Regular Q-Learners and depends on the opponent. In general, Mem1 has the highest variance of RSME throughout the games, followed by Mem5 and, lastly, Mem10. This is true for both MMQL and the Regular Q-Learners. The plots show how all three players in MMQL converge on a policy for Cooperator, Defector and Grudger quickly in the game, while the RSME fluctuation for the other opponents is more volatile. This is especially true when playing against itself. In general, it appears that introducing the ability of policy switching creates more instability in the state-value updates than simply using the same state representation throughout the game. However, this does not appear to be true for SuspiciousTFT and Grudger, where MMQL instead seems to find a more stable policy than the Regular Q-Learners (see Appendix C). Figures 7 & 8 shows the convergence plots for all players for opponents split into two groups: TFT-family and Cooperator & MMQL. 29 Figure 7 Convergence plots for the three players in MMQL (left column) and the three Regular Q-Learners (right column) against the TFT-family 30 Figure 8 Convergence plots for the three players in MMQL (left column) and the three Regular Q-Learners (right column) against MMQL & Cooperator Note. The Regular Q-Learners play against themselves, not MMQL. Therefore, the label “MMQL” in the right column is supposed to be “Regular Q-Learner”. 31 7. Discussion 7.1. Results analysis The result of Test 2 shows that MMQL does leverage some of the effects seen in Test 1. For example, it chooses to use the previously established best memory length for most of the time against certain opponents (for instance, against Defector and Win-Stay Shift-Lose). It also appears that a longer learning time has made the agent better at playing certain opponents (like Cooperator, Grudger & Defector). However, MMQL does not perform better in the tournament than Mem10 with continuous learning ability, and it did not place higher in the tournament as was predicted. Instead, TFT won, same as in the original Axelrod tournament (Axelrod, 1984). The low ranking of MMQL in the tournament seems to be because it did not learn to cooperate with cooperative agents. The ability to recognise its opponent and have a longer learning period only appears to have made the agent better at exploiting naively cooperating agents and avoiding being exploited itself. This is in line with the work done by Leibo et al. (2017) that showed that either cooperation or defection policies might be easier to learn depending on the problem itself. The PD being the most severe social dilemma, it is perhaps not surprising that the agent was quicker to develop a defection policy. Why didn’t MMQL develop more cooperation? One possible answer is that MMQL didn’t use Mem10 most of the time against cooperation-maintaining strategies (TFT family), as Mem10 had the highest cooperation rate out of the three players. But it could also be a result of the value function itself. By primarily focusing on maximising reward, it is difficult to learn to cooperate in the long run. The algorithm will always be biased towards the short term high reward of Temptation rather than the long term reward of continuous cooperation. It therefore takes time to learn a cooperating policy. It is possible that adjusting the learning and exploration rate might impact what policy the agent learns and how fast it learns it, potentially leading to more cooperation. However, the bias toward maximum reward will still assert a solid pull to defect for the agent. Another option is to change the payoff matrix, making Sucker’s payoff and Punishment into zero or negative values. This could lead the agents to cooperate more, as Punishment would no longer be a positive reward, potentially decreasing defection. Still, a maximising agent can develop cooperation. As described in chapter 3.1, a Q-learning agent can be taught to cooperate if first trained in an environment with cooperation-maintaining strategies (like TFT). This initialises the state-values to make cooperation a more desirable choice. But since MMQL starts with state-values as zero, it is easily biased towards the higher reward of Temptation, and it seems like only Mem10 can fully represent cooperation as a good choice. This follows Li & Kendall’s (2014) insight that longer memory strategies are more evolutionarily stable in the IPD. However, since Mem10 takes longer to settle on an optimal policy (since it has to visit more states several times to find the correct state-values), MMQL is biased toward using Mem5 and Mem1 since they 32 occasionally get a high reward by exploiting the good nature of its opponent. It’s possible that a longer run time still might lead MMQL to select Mem10 more as its current policy when playing cooperation-maintaining opponents (like the TFT-family). Yet, there is also a possibility that a longer run time will lower the cooperation rate of all players further and the solution should therefore focus on making Mem10 perform better sooner in the game. This could perhaps be done by setting a higher learning rate for Mem10 than the other players or simply changing the approach and using a function approximator instead. To incite cooperation in the MMQL, the agent might need additional capabilities, make use of a neural network or the problem itself could be reframed. The PD could, for example, be constructed as an n-person game where agents in a society can punish non-cooperating agents (Axelrod, 1997). This is done by adding further actions and payoffs so that players that see others defect can choose to punish those players (payoff ₋9), but at a certain cost for themselves (payoff ₋2). This could significantly affect the value function, as the player would have to adhere to social norms. Reframing the problem could be a useful step to get a self-interested agent to cooperate more. Still, if the objective is to play optimally in the IPD, it might be best to consider how well MMQL follows Axelrod’s suggestions on how to succeed in the game. When doing so, it is easy to see that MMQL is the opposite of the suggested best player: it is envious (it strives for a larger payoff than its opponents), it is not nice (it might be the first to defect), it does not reciprocate both cooperation and defection (it is biased towards defection), and it is not clear (it changes its behaviour throughout the game). RL might be a valuable tool to train other strategies in the IPD (see Harper et al., 2017), but it seems unlikely that a simple RL algorithm like Q-learning will ever place high in the standard tournament. However, the results show that having the ability to represent the problem with different state-representations, in this case, different memory lengths, could be beneficial when meeting heterogenous opponents. This is likely to remain true, even if the problem is reframed or if other capabilities are added to the algorithm. 7.2. MMQL analysis One problem with MMQL is the instability in the state-value changes, making it more difficult for the players to converge on a good policy. Adjusting two things in the architecture of the algorithm might solve this. The first fix is to have the three players play the opponent independently, meaning their Q-value updates would be based on their own actions and not those of the current policy. This might lead to each player developing a more distinct strategy and is likely to ease convergence. The second fix is to refine the policy evaluation of MMQL by adding a threshold for the difference in the average reward necessary for the current policy to switch. As of now, the policy switches, even if the difference in average reward is 0.1, which might be too small a difference to justify the instability that comes with policy-switching. Since more random actions are taken at the beginning of each game, the threshold could be set at a lower value at first, enabling more policy switching at the beginning of the game, and could then increase later in the game to ensure that only policies 33 that play considerably better are chosen as the current policy. This parameter could also be set to continue over the different rounds and not be reset for each game, creating more stability. The two solutions are expected to work well together, as the independent policies will be more representative of each player’s actual strategy, thus being a better model of the original idea of MMQL. Even if MMQL does not perform well in the IPD, having the ability to create different strategies and sense which strategy to use, depending on the environment, is bound to be an important function for future AIs in social dilemmas. Thus, the idea behind MMQL can be scaled to fit other problems and situations. The concept of MMQL can be used to create a Multi-Objective Reinforcement Learning (MORL) algorithm. MORL is based on the idea that many tasks have more than one goal or objective and that explicitly accounting for that when implementing the algorithm is better than simply representing the objective as a scalar value (Horie et al., 2019). By having more value functions that all have different goals, specific team players could be instructed to optimise for a cooperative opponent, and others could be instructed to optimise for an exploitative opponent. As shown in this study, different state representations are preferable depending on the opponent’s strategy. Having a team of agents with different objectives that can try out and change between different state representations to maximise those objectives might be a successful approach to MARL for social dilemmas. 7.3. Limitations It is worth mentioning that MMQL’s ability to learn over tournament repetitions might in some respects be seen as cheating. This is because none of the other players can recognise their opponent or remember their last interaction. However, this ability was added because of the perception that RL algorithms have an unfair disadvantage in the standard tournament. The nine default strategies used in this study all have a clear strategy to adhere to at every point of the game. Other non-static strategies in the IPD have often had training before participation. In contrast, all RL algorithms in the APL have to learn their strategy from scratch at the beginning of each game. This could be seen as unfair, and the idea was thus to see how an RL algorithm would perform in the tournament if it could keep its memory between games. 7.4. Future research The ability to change and try out state representations is akin to meta-RL, which deals with how agents learn how to learn (Hospedales et al., 2021). Combining this with multi-objective RL, it becomes Multi-Objective Meta-Learning (MOML) (Ye et al., 2021). Future research could investigate the best way to learn which state representation to use for specific problems and how this evaluation could occur within a meta-strategy with team members with different goals. Representing the state of the world is a complex problem and is described by Sutton & Barto (2018) as “presently more art than science”. This shows a knowledge gap, and future 34 studies should therefore be devoted to filling this gap so that the AI research community can develop best practices for safe and efficient state representations for different problems. 8. Conclusion This study contributed to new understanding of the role of memory length for an RL algorithm in the IPD by showing how different memory lengths are preferable depending on the opponent. Based on the findings, a new algorithm presenting a novel approach to the IPD was created, called Mixed Memory Q-Learner. MMQL had the ability to switch state representation (memory length in the IPD) during play and to recognise its opponents between interactions. By comparing MMQL to other algorithms with fixed state representations and the ability to learn across several games, the results showed that the ability to flexibly switch between different memory lengths gave the algorithm an advantage against certain opponents. However, despite the predictions, MMQL did not perform better in the tournament overall. A likely explanation for the low ranking is that the algorithm only learned to exploit and defend itself but not cooperate. Future research could therefore extend the concept of MMQL to a Multi-Objective Meta Learning (MOML) problem, using different objectives and state representations depending on the strategy of the opponent, thus inciting more cooperation when advisable. The results hint at the importance of flexibility in the state representations for an RL-algorithm when partaking in a social dilemma. 35 9. References Anastassacos, N., & Musolesi, M. (2018). Learning through Probing: a decentralized reinforcement learning architecture for social dilemmas. ArXiv, abs/1809.10007. Axelrod Python library. (2015). Axelrod Documentation Page. https://axelrod.readthedocs.io/en/stable/ Axelrod, R. (1980). Effective choice in the prisoner’s dilemma. The Journal of Conflict Resolution, 24(1):3–25. Axelrod, R. (1984). Evolution Of Cooperation. Basic Books. Axelrod, R. (1997). The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration. Princeton University Press. Bertino, E., Doshi-Velez, F., Gini, M.L., Lopresti, D., & Parkes, D. (2020). Artificial Intelligence & Cooperation. ArXiv, abs/2012.06034. Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014 Glynatsi, N. (2020) Understanding responses to environments for the Prisoner’s Dilemma: A meta analysis, multidimensional optimisation and machine learning approach (Doctoral dissertation, Cardiff University, Cardiff, United Kingdom). Retrieved from http://orca.cf.ac.uk/id/eprint/135221 Gupta, J.K., Egorov, M., & Kochenderfer, M.J. (2017). Cooperative Multi-agent Control Using Deep Reinforcement Learning. AAMAS Workshops. Harper, M., Knight, V., Jones, M., Koutsovoulos, G., Glynatsi, N. E., & Campbell, O. (2017). Reinforcement learning produces dominant strategies for the Iterated Prisoner’s Dilemma. PLOS ONE, 12(12), e0188046. https://doi.org/10.1371/journal.pone.0188046 Hernandez-Leal, P., Kaisers, M., Baarslag, T., & Cote, E.M. (2017). A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity. ArXiv, abs/1707.09183. Hoang, T.N., Xiao, Y., Sivakumar, K., Amato, C., & How, J. (2018). Near-Optimal Adversarial Policy Switching for Decentralized Asynchronous Multi-Agent Systems. 2018 IEEE International Conference on Robotics and Automation (ICRA), 6373-6380. 36 Horie, N., Matsui, T., Moriyama, K., Mutoh, A., & Inuzuka, N. (2019). Multi-objective safe reinforcement learning. Artificial Life and Robotics. Published. https://doi.org/10.1007/s10015-019-00524-2 Hospedales, T.M., Antoniou, A., Micaelli, P., & Storkey, A. (2021). Meta-Learning in Neural Networks: A Survey. IEEE transactions on pattern analysis and machine intelligence, PP. Jackson, M. O. (2011). A Brief Introduction to the Basics of Game Theory. SSRN Electronic Journal, 1–21. https://doi.org/10.2139/ssrn.1968579 Jurišić, M., Kermek, D., & Konecki, M. (2012). A review of iterated prisoner's dilemma strategies. Proceedings of the 35th International Convention MIPRO, Opatija, Croatia, 2012, pp. 1093-1097. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, 237–285. https://doi.org/10.1613/jair.301 Kaisers, M. (2012). Learning against learning: evolutionary dynamics of reinforcement learning algorithms in strategic interactions. (Doctoral dissertation, Maastricht University, Maastricht). Retrieved from: http://michaelkaisers.com/publications/2012_PhD_Kaisers.pdf Lapan, M. (2018). Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more (2nd edition). Birmingham: Packt Publishing. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., & Graepel, T. (2017). Multi-agent Reinforcement Learning in Sequential Social Dilemmas. ArXiv, abs/1702.03037. Leyton-Brown, K. (2008). Essentials of Game Theory: A Concise, Multidisciplinary Introduction (Synthesis Lectures on Artificial Intelligence and Machine Learning) (Illustrated ed.). Morgan and Claypool Publishers. Li, J., & Kendall, G. (2014). The Effect of Memory Size on the Evolutionary Stability of Strategies in Iterated Prisoner's Dilemma. IEEE Transactions on Evolutionary Computation, 18, 819-826. Lin, B., Bouneffouf, D., & Cecchi, G. (2020). Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior. ArXiv, abs/2006.06580. Merriam-Webster. (n.d.). Algorithm. In Merriam-Webster.com dictionary. Retrieved May 1, 2021, from https://www.merriam-webster.com/dictionary/algorithm 37 Nguyen, T. T., Nguyen, N. D., & Nahavandi, S. (2020). Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE Transactions on Cybernetics, 50(9), 3826–3839. https://doi.org/10.1109/tcyb.2020.2977374 O'Riordan, C., Griffith, J., & Sorensen, H. (2003). Forgiveness in Strategies in Noisy Multi-agent Environments. CEEMAS. Poundstone, W. (1993). Prisoner’s Dilemma: John von Neumann, Game Theory, and the Puzzle of the Bomb (First Edition Thus ed.). Anchor.Markov Pumperla, M., & Ferguson, K. (2019). Deep Learning and the Game of Go (1st ed.). Manning Publications. Rand, D. G., & Nowak, M. A. (2013). Human cooperation. Trends in Cognitive Sciences, 17(8), 413–425. https://doi.org/10.1016/j.tics.2013.06.003 Rapoport, A., & Chammah, A. M. (1965). Prisoner’s Dilemma (First Edition). University of Michigan Press. Rapoport, A. (1974). Prisoner’s Dilemma — Recollections and Observations. In: Rapoport A. (eds) Game Theory as a Theory of a Conflict Resolution. Theory and Decision Library (An International Series in the Philosophy and Methodology of the Social and Behavioral Sciences), vol 2. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-2161-6_2 Ross, D., (2019). Game Theory. In E. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Winter 2019 ed.). Stanford University (Winter 2019 ed.). https://plato.stanford.edu/entries/game-theory/ Samuelson, P. A. (1938). A Note on the Pure Theory of Consumer’s Behaviour. Economica, 5(17), 61. https://doi.org/10.2307/2548836 Sandholm, T. W., & Crites, R. H. (1996a). Multiagent reinforcement learning in the Iterated Prisoner’s Dilemma. Biosystems, 37(1–2), 147–166. https://doi.org/10.1016/0303-2647(95)01551-5 Sandholm, T. W., & Crites, R. H. (1996b). On multiagent Q-learning in a semi-competitive domain. Lecture Notes in Computer Science Adaption and Learning in Multi-Agent Systems, 191-205. doi:10.1007/3-540-60923-7_28 Seiffertt, J., Mulder, S., Dua, R., & Wunsch, D. C. (2009). Neural networks and Markov models for the iterated prisoners dilemma. 2009 International Joint Conference on Neural Networks. doi:10.1109/ijcnn.2009.5178800 38 Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359. https://doi.org/10.1038/nature24270 Simon, H. A. (1982). Models of Bounded Rationality, Volume 1: Economic Analysis and Public Policy (1st ed.). The MIT Press. Singer-Clark, T. (2014). Morality Metrics On Iterated Prisoner’s Dilemma Players. Retrieved from: https://www.scottaaronson.com/morality.pdf Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning, second edition: An Introduction (Adaptive Computation and Machine Learning series) (second edition). Bradford Books. Watkins, C. (1989). Learning from delayed reward (Doctoral dissertation, King’s College, Cambridge). Retrieved from: https://www.researchgate.net/publication/33784417_Learning_From_Delayed_Rewa rds Watkins, C. J., & Dayan, P. (1992). Technical Note. Q-Learning. Machine Learning, 8(3-4), 279-292. doi:10.1007/BF00992698 Yannakakis, G. N., & Togelius, J. (2018). Artificial Intelligence and Games (1st ed. 2018 ed.). Springer. Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., & Zhang, Y. (2021). Multi-Objective Meta Learning. ArXiv, abs/2102.07121. 39 Appendix A Code The code created to realise MMQL can be found on the author’s Github via this link: https://github.com/dollbo/Thesis_project To run the program, please contact the author for further instructions. 40 Appendix B Markov Decision Processes description A Markov Decision Process (MDP) is a framework from probability theory that is used to model sequential decision problems (Seiffertt et al., 2009). To find a policy for how an agent should behave in a certain environment, the RL problem is modelled as an MDP (Lapan, 2018). An MDP is an extension of a Markov Process (MP) and a Markov Reward Process (MRP). An MP, also called Markov chain, describes a system that has different states that change over time according to some dynamic rules. The MP cannot be influenced, but only observed as the state changes. The state space is the name of the set of all possible states of a system, and the MP or Markov chain is constituted by a number of observations of different states of a system as it happens. To qualify as an MP the system must have the Markov property. This means that each state must contain information of all history of interaction that is relevant for the future (Sutton & Barto, 2018). This is because all the future states of the system must depend only on this state, which means that every state has to be distinguishable from other states (Lapan, 2018). An MP thus contains a set of states and a transition matrix that describes the probability of one state shifting to another state. An MRP extends the MP framework by adding a value to the transitions to different states. This value is the reward. In the MRP, at a certain time point, the rewards from ensuing states are calculated and summed up as the return at that particular time point. However, how much the system considers the rewards gained at distant time steps is adjusted for by the parameter gamma 𝛾. With a low gamma (value close to 0), only the immediate rewards are considered as important information about the state at that time. With a higher gamma (close to 1) distant rewards are also considered in the evaluation. This return quantity for each timestep can be calculated for the whole state space, which will give the state value. This means that every state has a value attached to it which is the expected return to get from that state given we follow the MRP. To make an MRP into an MDP, actions are added to the process. This affects both the transition matrix and the reward matrix. The transition matrix changes as taking different actions lead to different states and therefore different actions have different probabilities of leading to certain states. The reward matrix is affected by this as the reward the agent receives is not just conditioned on the state it is in anymore, but also on the action that resulted in that state. As the goal of an RL agent is to obtain as high a reward as possible, the solution to an MDP is the policy that informs the agent which is the optimal actions to take in each state to obtain the highest return. An MDP is therefore a useful framework as it captures important aspects of the problem facing any learning agent that is trying to reach a goal by interacting with its environment. The MDP framework captures typical features of learning dynamics such as a sense of cause and effect, the uncertainty of which action will lead to 41 what state, and explicit goals of the agent (Sutton & Barto, 2018). Below is a formulated description of an MDP: MDP is a 5-tuple (S, A, P, R, 𝛾), where ● S is a set of states ● A is a set of actions ● P(s, a, s’) is the probability that action a in state s at time t will lead to state s’ at time t+1 ● R(s, a, s’) is the immediate reward obtained by taking action a in state s which lead to state s’ ● 𝛾 is the discount factor used to create a discounted reward 42 Appendix C Figures from Test 2 Figure C1 Percentage played with each player for a) Suspicious TitForTat, b) Win-Stay Shift-Lose, c) Random, d) Grudger, e) 2TitsForTat, and f) MMQL 43 Figure C2 Convergence plots for the three players in MMQL (left column) and the three Regular Q-Learners (right column) against Cooperator and Defector 44 Figure C3 Convergence plots for the three players in MMQL (left column) and the three Regular Q-Learners (right column) against Grudger, Random & Win-Stay Shift-Lose. 45